# Exploratory Data Analysis

We will be doing the following in this section

## Step 0: Getting started  
- Import libraries  
- Import dataset  

## Step 1: Data visualisation  
- Exploring Scope  
- Choosing variables  
- Visualizing individual variables  
- Visualizing variable pairs  
- Visualizing as a whole
  
## Step 2: Data analysis
- Correlation coeficients
- Reverse-engineering rank

## Step 0: Getting Started

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the read_csv function from Pandas.  
Immediately after importing, take a quick look at the data using the head function.

In [None]:
userlist = pd.read_csv('DataSets/Cleaned data/outV1.csv')
userlist.head()

## Step 1.1: Exploring scope

We see that some columns are categorical, while some are numerical.  
Let us figure out what outputs are in the categorical columns, and what range of values there are in the numerical ones.

### Printing out possible values of categorical variables

We are interested in finding out how many types of values there are for the following columns
1. Type
2. Source
3. Status
4. Rating

In [None]:
# Filtering the Type column
type_unique_values = userlist['type'].unique()

# Print the unique values for Type
print("Unique values for Type:")
for value in type_unique_values:
    print(value)
print("\n")

# Filtering the Source column
source_unique_values = userlist['source'].unique()

# Print the unique values for Source
print("Unique values for source:")
for value in source_unique_values:
    print(value)
print("\n")

# Filtering the Status column
status_unique_values = userlist['status'].unique()

# Print the unique values for Status
print("Unique values for status:")
for value in status_unique_values:
    print(value)
print("\n")

# Filtering the Rating column
rating_unique_values = userlist['rating'].unique()

# Print the unique values for Rating
print("Unique values for rating:")
for value in rating_unique_values:
    print(value)
print("\n")

# Filtering the Studio column
studio_unique_values = userlist['studio'].unique()

# Print the unique values for Studio
print("Unique values for studio:")
for value in studio_unique_values:
    print(value)
print("\n")

### Printing out range of values for numerical variables

We are interested in finding out the minimum and maximum values for the following columns
1. Episodes
2. Duration
3. Score
4. Score_by
5. Rank
6. Popularity
7. Members
8. Favourites

In [None]:
# Filtering the Episodes column
episode_minimum_value = userlist['episodes'].min()
episode_maximum_value = userlist['episodes'].max()

print("Minimum value in for episode:", episode_minimum_value)
print("Maximum value in for episode:", episode_maximum_value)
print("\n")

# Filtering the Duration column
duration_minimum_value = userlist['duration'].min()
duration_maximum_value = userlist['duration'].max()

print("Minimum value in for duration:", duration_minimum_value)
print("Maximum value in for duration:", duration_maximum_value)
print("\n")

# Filtering the Score column
score_minimum_value = userlist['score'].min()
score_maximum_value = userlist['score'].max()

print("Minimum value in for score:", score_minimum_value)
print("Maximum value in for score:", score_maximum_value)
print("\n")

# Filtering the Rank column
rank_minimum_value = userlist['rank'].min()
rank_maximum_value = userlist['rank'].max()

print("Minimum value in for rank:", rank_minimum_value)
print("Maximum value in for rank:", rank_maximum_value)
print("\n")

# Filtering the Popularity column
popularity_minimum_value = userlist['popularity'].min()
popularity_maximum_value = userlist['popularity'].max()

print("Minimum value in for popularity:", popularity_minimum_value)
print("Maximum value in for popularity:", popularity_maximum_value)
print("\n")

# Filtering the Members column
member_minimum_value = userlist['members'].min()
member_maximum_value = userlist['members'].max()

print("Minimum value in for member:", member_minimum_value)
print("Maximum value in for member:", member_maximum_value)
print("\n")

# Filtering the Favourites column
favourite_minimum_value = userlist['favorites'].min()
favourite_maximum_value = userlist['favorites'].max()

print("Minimum value in for favourite:", favourite_minimum_value)
print("Maximum value in for favourite:", favourite_maximum_value)

## Step 1.2: Choosing Initial Variables

We have decided to pick 5 variables to explore, and this section breaks down what they mean and how they are calculated


### Score

Score is a weighted average of the rating that users give to a particular anime  
It is calculated using the following formula:  

(v/(v+m)) * S + (m/v+m) * C  

Where
> v = Number of users giving a score  
> m = Minimum number of scored users required to get a calculated score  
> S = Average score  
> C = Mean score across the entire datase 

We can see that S and C are essentially unweighted averages

### Rank

It is not immediately clear how rank is calculated, perhaps we can run a regression operation to reverse engineer the formula...

### Popularity

This is the number of viewers that watched a particular anime

### Members

This is the number of people who added a particular anime to their list  
The list is for

### Favourites

This is the number of people who favourited a particular anime

## Step 1.3: Visualizing Individual Variables

Let us look in more detail at each chosen variable, using visual tools

## Step 1.4: Looking at chosen variables in pairs

Now that we know what the variables entail, let us compare them against one another

## Step 1.5: Comparing all chosen variables

Here we utilize a pair-plot to get the big picture

## Step 2.1: Correlation Coefficients

We look at correlation coefficients - specifically against Rank, in order to clue us into which variables to consider for reverse engineering its formula

In [None]:
# Calculating correlation coefficient between rank and score 
rank_vs_score_correlation_coefficient = userlist['rank'].corr(userlist['score'])
print(f"Correlation Coefficient between rank and score: {rank_vs_score_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and popularity 
rank_vs_popularity_correlation_coefficient = userlist['rank'].corr(userlist['popularity'])
print(f"Correlation Coefficient between rank and popularity: {rank_vs_popularity_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and members
rank_vs_members_correlation_coefficient = userlist['rank'].corr(userlist['members'])
print(f"Correlation Coefficient between rank and members: {rank_vs_members_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and favourites
rank_vs_favorites_correlation_coefficient = userlist['rank'].corr(userlist['favorites'])
print(f"Correlation Coefficient between rank and favorites: {rank_vs_favorites_correlation_coefficient}")
print("\n")

### Evaluation

Remember that to intepret the values:  
> -1 indicates perfect negative correlation  
> 0 indicates no correlation  
> 1 indicates perfect positive correlation

With that in mind, these are the coefficients (rounded off to 5 d.p.):
* Rank and score: -0.66519
* Rank and popularity: 0.70756
* Rank and members: -0.38977
* Rank and favorites: -0.18767

From the results above, we can easily eliminate favourites as a variable, given how close its value is to 0  
Members is also up for consideration for elimination

## Step 2.2: Linear Regression

Let's try the simplest form of modelling to predict Rank as a start!

### Splitting the data

We must first split the data into train and test sets
Take note that a seed value of 42 was chosen here for reproducibility

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% training, 20% testing) with a seed of 42
Predictor_3var_train, Predictor_3var_test, Response_train, Response_test = train_test_split(Predictor_3var, Response, test_size=0.2, random_state=42)

### 3-Variable Model

Since we aren't sure if Members should be included as a variable to predict Rank, let us include it first

In [None]:
# Predictor_3var contains the features (score, popularity, members) 
# Response contains the target variable (rank)
Predictor_3var = userlist[['score', 'popularity', 'members']]
Response = userlist['rank']

We create a linear regression model and train it using the train set

In [None]:
from sklearn.linear_model import LinearRegression

# Create and train the linear regression model
linear_reg_model_3var = LinearRegression()
linear_reg_model_3var.fit(Predictor_3var_train, Response_train)

After training, we use the model to predict rank on the test set  
We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Predict Rank for the test set
Rank_pred_linear_reg_3var = linear_reg_model_3var.predict(Predictor_3var_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(Response_test, Rank_pred_linear_reg_3var, color = "red")
plt.show()

To get a better evaluation of how effective the model is, we use the following metrics:
1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_3var = mean_squared_error(Response_test, Rank_pred_linear_reg_3var)
print(f"Mean Squared Error: {mse_3var}")

# Calculaing root mean squared error
rmse_3var = mean_squared_error(Response_test, Rank_pred_linear_reg_3var, squared = False)
print(f"Root Mean Squared Error: {rmse_3var}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_3var = mean_absolute_error(Response_test, Rank_pred_linear_reg_3var)
print(f"Mean Absolute Error: {mae_3var}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_3var = r2_score(Response_test, Rank_pred_linear_reg_3var)
print(f"R-squared Score: {r2_3var}")

### 2-Variable Model

Let us also try linear regression without members

In [None]:
# Predictor contains the features (score and popularity) 
# Response contains the target variable (rank)
Predictor_2var = userlist[['score', 'popularity']]
Response = userlist['rank']

After splitting the data, we create a linear regression model and train it using the train set

In [None]:
from sklearn.linear_model import LinearRegression

# Create and train the linear regression model
linear_reg_model_2var = LinearRegression()
linear_reg_model_2var.fit(Predictor_2var_train, Response_train)

In [None]:
After training, we use the model to predict rank on the test set  
We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Predict Rank for the test set
Rank_pred_linear_reg_2var = linear_reg_model_2var.predict(Predictor_2var_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(Response_test, Rank_pred_linear_reg_2var, color = "red")
plt.show()

To get a better evaluation of how effective the model is, we use the following metrics:
1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_2var = mean_squared_error(Response_test, Rank_pred_linear_reg_2var)
print(f"Mean Squared Error: {mse_2var}")

# Calculaing root mean squared error
rmse_2var = mean_squared_error(Response_test, Rank_pred_linear_reg_2var, squared = False)
print(f"Root Mean Squared Error: {rmse_2var}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_2var = mean_absolute_error(Response_test, Rank_pred_linear_reg_2var)
print(f"Mean Absolute Error: {mae_2var}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_2var = r2_score(Response_test, Rank_pred_linear_reg_2var)
print(f"R-squared Score: {r2_2var}")

### Evaluation

Let's compare the 4 metrics from both models to see which one is more accurate

Remember that to evaluate these metrics
> A lower MSE indicates better performance (0 lowest)
> A lower RMSE indicates better performance (0 lowest)
> A lower MAE indicates better performance (0 lowest)
> A higher R^2 indicates better performance (1 highest)

Here are the results for the 3-var model (rounded off to 5 d.p.):
* Mean Squared Error: 6269914.25204
* Root Mean Squared Error: 2503.97968
* Mean Absolute Error: 1828.76560
* R-Squared Score: 0.62505

And here are the results for the 2-var model (rounded off to 5 d.p.):
* Mean Squared Error: 6307853.47160
* Root Mean Squared Error: 2511.54404
* Mean Absolute Error: 1827.69689
* R-Squared Score: 0.62278

We can see that they are extremely close, with R^2 scores only differing by around 0.0022. Even though the MSE and RMSE go up slightly, the MAE actually falls slighly too   

Thus, we conclude that Members is indeed not a very helpful metric for determining Rank, so it can be safely removed

### Retrieving the formula

Now that we have our chosen model, let us retrieve the exact coefficients and intercepts that were used to generate it

In [None]:
# Retrieve the coefficients and intercept
coefficients_2var = linear_reg_model_2var.coef_
intercept_2var = linear_reg_model_2var.intercept_

# Construct the formula
formula = f"Rank = {intercept_2var:.2f} + "
for i, coef in enumerate(coefficients_2var):
    formula += f"({coef:.2f} * Predictor_{i+1}) + "

# Remove the trailing '+' and whitespace
formula = formula[:-3]

print("Formula:", formula)

We can this see that:  
Rank = 10629.83 + (-1127.91 * Score) + (0.48 * Popularity)

## Step 2.3: Polynomial Regression

A polynomial regression is able to capture more complex patterns in the data. Maybe that will help us predict Rank better?

### Degree-2 Model

We will be re-using the previous train and test sets, Predictor_2var and Response, except they will now be converted into polynomial features

### Degree-3 Model

### Evaluation

### Retrieving the formula