<a href="https://colab.research.google.com/github/Myrica7/My-website/blob/main/NUS_DATATHON_CHAMPION_TEAM_306.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### The cell below is for you to keep track of the libraries used and install those libraries quickly
##### Ensure that the proper library names are used and the syntax of `%pip install PACKAGE_NAME` is followed

In [None]:
%pip install pandas
%pip install matplotlib
# add commented pip installation lines for packages used as shown above for ease of testing
# the line should be of the format %pip install PACKAGE_NAME



## **DO NOT CHANGE** the filepath variable
##### Instead, create a folder named 'data' in your current working directory and
##### have the .csv file inside that. A relative path *must* be used when loading data into pandas

In [None]:
# Can have as many cells as you want for code
import pandas as pd
filepath = "/catA_train.csv"
#remember to change back the filepath

# the initialised filepath MUST be a relative path to a folder named data that contains the parquet file

df = pd.read_csv(filepath)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/catA_train.csv'

### **ALL** Code for machine learning and dataset analysis should be entered below.
##### Ensure that your code is clear and readable.
##### Comments and Markdown notes are advised to direct attention to pieces of code you deem useful.

In [None]:
#Import packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
print(df.shape)
print(df.info())

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.duplicated().sum()

1. Data cleaning -- Handle missing data

In [None]:
#Remove Square Footage column
df2 = df.drop(['Square Footage', 'Fiscal Year End'], axis=1)
#Drop rows with NA value for LATITUDE and LONGITUDE columns
df2 = df2.dropna(subset=['LATITUDE','LONGITUDE'])
#Include entries from active companies only
df2 = df2[df2['Company Status (Active/Inactive)'] == 'Active']

print(df2.shape)
df2

One-Hot Encoding

In [None]:
df3 = pd.get_dummies(df2, columns = ['Entity Type'])
df3.loc[df3['Entity Type_Branch'] == True, 'Entity Type_Branch'] = 1
df3.loc[df3['Entity Type_Branch'] == False, 'Entity Type_Branch'] = 0
df3.loc[df3['Entity Type_Independent'] == True, 'Entity Type_Independent'] = 1
df3.loc[df3['Entity Type_Independent'] == False, 'Entity Type_Independent'] = 0
df3.loc[df3['Entity Type_Parent'] == True, 'Entity Type_Parent'] = 1
df3.loc[df3['Entity Type_Parent'] == False, 'Entity Type_Parent'] = 0
df3.loc[df3['Entity Type_Subsidiary'] == True, 'Entity Type_Subsidiary'] = 1
df3.loc[df3['Entity Type_Subsidiary'] == False, 'Entity Type_Subsidiary'] = 0
df3.head()

2. EDA

Make plots

In [None]:
#Check correlation
numerical_col = df3[['LATITUDE','LONGITUDE','SIC Code','8-Digit SIC Code','Year Found','Is Domestic Ultimate','Is Global Ultimate','Entity Type_Branch','Entity Type_Independent','Entity Type_Parent','Entity Type_Subsidiary','Employees (Domestic Ultimate Total)', 'Employees (Global Ultimate Total)', 'Sales (Domestic Ultimate Total USD)', 'Sales (Global Ultimate Total USD)']]
numerical_col.corr()

In [None]:
# Create a histogram
sns.histplot(numerical_col['Year Found'], bins = 30, color='blue')

# Customize the plot
plt.title('Count of Companies Based on Year Founded')
plt.xlabel('Year Found')
plt.ylabel('Number of companies')

# Show the plot
plt.figure()
plt.show()

In [None]:
#Create pairplot
sns.pairplot(numerical_col)
plt.show()

In [None]:

sns.heatmap(numerical_col.corr(), annot=True, vmin=-1, vmax=1, annot_kws={"size": 5})
plt.show()

Feature Selection

In [None]:
#Feature Ranking (check correlation between features and target variable)

EDA - Data manipulation

Data type conversion and possibly make new features

In [None]:
#Convert "Is Domestic Ultimate" and "Is Global Ultimate" to boolean
df2['Is Domestic Ultimate'] = df2['Is Domestic Ultimate'] == 1
df2['Is Global Ultimate'] = df2['Is Global Ultimate'] == 1
df2.head()

3. Model Training and Model Selection

Model Selection

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
import xgboost as xgb

In [None]:
# Separate features and target variable
X = df2.drop('Sales (Domestic Ultimate Total USD)', axis=1)
y = df2['Sales (Domestic Ultimate Total USD)']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=306)

# Initialize the GradientBoostingRegressor
model = xgb.XGBRegressor(objective="reg:linear", random_state=3-6)

Cross Validation

In [None]:
# Lists to store results
n_folds_values = list(range(4, 16))
mean_r2_scores = []
std_r2_scores = []

# Iterate over different numbers of folds
for n_folds in n_folds_values:
    # Use k-fold cross-validation with the current number of folds
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    # Perform cross-validation and get R-squared scores
    cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')

    # Append mean and standard deviation of R-squared scores to lists
    mean_r2_scores.append(cv_scores.mean())
    std_r2_scores.append(cv_scores.std())

# Plot the results
plt.errorbar(n_folds_values, mean_r2_scores, yerr=std_r2_scores, marker='o', linestyle='-', label='R-squared scores')
plt.xlabel('Number of Folds')
plt.ylabel('R-squared Score')
plt.title('Cross-Validated R-squared Scores for Different Numbers of Folds')
plt.legend()
plt.show()

4. Model evaluation

Performance Metrics -- Loss Functions (Adjusted R^2, MAE, MSE)

## The cell below is **NOT** to be removed
##### The function is to be amended so that it accepts the given input (dataframe) and returns the required output (list).
##### It is recommended to test the function out prior to submission
-------------------------------------------------------------------------------------------------------------------------------
##### The hidden_data parsed into the function below will have the same layout columns wise as the dataset *SENT* to you
##### Thus, ensure that steps taken to modify the initial dataset to fit into the model are also carried out in the function below

In [None]:
def testing_hidden_data(hidden_data: pd.DataFrame) -> list:
    '''DO NOT REMOVE THIS FUNCTION.

The function accepts a dataframe as input and return an iterable (list)
of binary classes as output.

The function should be coded to test on hidden data
and should include any preprocessing functions needed for your model to perform.

All relevant code MUST be included in this function.'''

    result = []
    return result

##### Cell to check testing_hidden_data function

In [None]:
# This cell should output a list of predictions.
test_df = pd.read_csv(filepath)
test_df = test_df.drop(columns=['Sales (Domestic Ultimate Total USD)'])
print(testing_hidden_data(test_df))

### Please have the filename renamed and ensure that it can be run with the requirements above being met. All the best!