# NBA Games Prediction Demo

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Reading data

In [None]:
# Downloading the dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample = pd.read_csv("sample_submission.csv")

## Basic data exploration
This is done as a way to simply have a sense of what the data that you are handling will be like before you dive deeper into it. 

In [None]:
# Checking what the data looks like
train.head()

In [None]:
# More info about the data
train.info(verbose=True, show_counts=True)

In [None]:
# Shape of the datasets
print(train.shape)
print(test.shape)

## Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial initial phase in data analysis that involves examining and summarizing the main characteristics of a dataset. Its primary objective is to gain insights into the data, understand its underlying structure, identify patterns, and spot anomalies or relationships between variables.

### Key Aspects of Exploratory Data Analysis:
<u>Data Summarization</u>: EDA involves summarizing the main properties of the dataset using statistical measures like mean, median, standard deviation, etc., to get an overall understanding of the data.

<u>Visualization Techniques</u>: Graphical representations such as histograms, box plots, scatter plots, heatmaps, and others are used to visualize distributions, trends, correlations, and outliers in the data. Visualizations help in understanding the data's structure more intuitively.

<u>Handling Missing Values</u>: EDA identifies and deals with missing values or inconsistencies within the dataset through imputation or removal to ensure data integrity.

<u>Detecting Patterns and Relationships</u>: EDA examines the relationships between variables, seeking correlations, associations, or trends that can provide meaningful insights for further analysis.

<u>Outlier Detection</u>: Identification and assessment of outliers or anomalies in the data that might affect statistical analyses or modeling.

<u>Feature Engineering Considerations</u>: EDA may involve transforming or engineering features to improve predictive models or reduce dimensionality.

<u>Data Preprocessing Insights</u>: Understanding the need for data cleaning, normalization, scaling, or encoding before modeling.

### Other Resources for EDA
Please refer to the [Datathon Starter Pack]() to see examples on how to conduct EDA and techniques that you can use. 

In [None]:
# Distribution of points
sns.histplot(data=train, x='pts', kde=True)

In [None]:
# Distribution of fg%
sns.histplot(data=train, x='fg%', kde=True)

In [None]:
# Distribution of 3p%
sns.histplot(data=train, x='3p%', kde=True)

In [None]:
# Seeing the number of wins and losses of the Toronto Raptors
sns.countplot(data=train[train['team']=='TOR'], x='won')

In [None]:
# Looking at the correlation between points and the result of the game
sns.boxplot(data=train, x='won', y='pts')

## Data Cleaning/Data Pre-processing/Feature Engineering

Data scientists engage in data cleaning and feature engineering as essential steps before model creation for several crucial reasons:

1. <u>Ensuring Data Quality</u>:
Data cleaning involves handling missing values, correcting errors, and addressing inconsistencies. This process ensures the dataset's integrity and reliability, reducing biases or inaccuracies that could mislead model predictions.
2. <u>Enhancing Model Performance</u>:
Feature engineering involves transforming raw data into meaningful and predictive features. Creating new features or modifying existing ones helps the model better capture patterns and relationships within the data, leading to improved predictive performance.
3. <u>Mitigating Model Overfitting</u>:
Data cleaning and feature engineering help in reducing noise and irrelevant information in the dataset. By selecting or engineering relevant features, the model becomes less prone to overfitting, making it more generalizable to new data.
4. <u>Handling Dimensionality</u>:
Feature engineering can reduce dimensionality by selecting or creating a subset of the most informative features. This simplifies the model, making it computationally efficient while retaining essential information for accurate predictions.
5. <u>Adapting to Model Assumptions</u>:
Data cleaning ensures that the data meets assumptions required by certain models. For instance, linear models assume no multicollinearity, and feature engineering can address this by handling correlated features.
6. <u>Improving Interpretability</u>:
Well-engineered features can enhance interpretability of the model's predictions. Creating features that align with domain knowledge allows stakeholders to comprehend and trust the model's outputs.
7. <u>Preparing Data for Various Models</u>:
Different models have diverse requirements. Data cleaning and feature engineering enable preparing the data to fit specific model algorithms, maximizing their effectiveness.

In summary, data cleaning and feature engineering lay the groundwork for accurate, robust, and reliable predictive models. These steps not only refine the dataset but also empower models to effectively learn patterns and relationships, enhancing their performance and applicability in solving real-world problems.

In [None]:
# Sort by date and resetting the index
train = train.sort_values("date")
train = train.reset_index(drop=True)
test = test.sort_values("date")
test = test.reset_index(drop=True)

# Delete unnecessary columns in both train and test datasets (these are the "max" columns)
train = train.drop(train.iloc[:, 34:66],axis = 1)
train = train.drop(train.iloc[:, 71:103],axis = 1)
train = train.drop(["index_opp","mp_opp"],axis=1)

test = test.drop(test.iloc[:, 34:66],axis = 1)
test = test.drop(test.iloc[:, 71:103],axis = 1)
test = test.drop(["index_opp","mp_opp"],axis=1)

# Changing the 'won' column from True/False to 1's and 0's
train['won'] = train['won']*1
test['won'] = test['won']*1

# Separating the date column into 3 columns: year, month, day
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

train['day'] = train['date'].dt.day
train['month'] = train['date'].dt.month
train['year'] = train['date'].dt.year
train = train.drop("date",axis=1)

test['day'] = test['date'].dt.day
test['month'] = test['date'].dt.month
test['year'] = test['date'].dt.year
test = test.drop("date",axis=1)

# Creating dummy variables for categorical variables
train = pd.get_dummies(train, columns=['team', 'team_opp'], drop_first=True)
test = pd.get_dummies(test, columns=['team', 'team_opp'], drop_first=True)

# Model Building
The model training phase is a core step in machine learning where a selected algorithm learns patterns and relationships within the provided data. This process involves:

<u>Data Input</u>: Using a portion of the prepared dataset known as the training set.

<u>Algorithm Application</u>: Employing the chosen machine learning algorithm to analyze the training data. The algorithm adjusts its internal parameters to recognize patterns that link input features to the desired output or target.

<u>Optimization</u>: Through repeated iterations, the algorithm fine-tunes its parameters to minimize the difference between predicted outcomes and actual results.

<u>Evaluation</u>: Assessing the model's performance using evaluation metrics specific to the problem type (accuracy, loss, etc.) to ensure its effectiveness in capturing patterns and making accurate predictions.

<u>Iterative Process</u>: Often requiring multiple iterations to adjust the model settings or parameters (such as learning rate or complexity) for improved performance.

The goal of model training is to enable the algorithm to learn from the data, allowing it to generalize well on new, unseen data, and make accurate predictions or classifications.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score

In [None]:
# Splitting the train data into X and y
X = train.drop("target",axis=1)
y = train['target']

# Splitting the train data into another train and test set
# This is done to verify that our model is not overfitting
# NOTE: This test set is different from our original test set because 
#       this test is set is derived from our original train set. 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
# Checking the metrics
print("The balanced accuracy score of the model is: " + str(balanced_accuracy_score(y_test,predictions)))

# Predictions for Test Dataset

In [None]:
# Creating new model for the official test set
final_model = LogisticRegression()
final_model.fit(X,y)
final_predictions = final_model.predict(test)

# Creating a new dataframe in the required submission format
submission = pd.DataFrame(test['game_id'])
submission['target'] = final_predictions

submission

# Saving csv file for Submission

In [None]:
submission.to_csv('submission.csv',index=False)