# Exploratory Data Analysis

We want to achieve a few things in this section
1. Find out what each column does, and what values it can take
2. List the columns we might find helpful to answer questions
3. Plot the distribution of interesting columns (variables)
4. Analyse relationships between these variables
5. From all of the above, eventually we should come up with a definition of "success" for each anime

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the read_csv function from Pandas.  
Immediately after importing, take a quick look at the data using the head function.

In [None]:
userlist = pd.read_csv('DataSets/Cleaned data/outV1.csv')
userlist.head()

### Printing out possible values of categorical variables

We are interested in finding out how many types of values there are for the following columns
1. Type
2. Source
3. Status
4. Rating

In [None]:
# Filtering the Type column
type_unique_values = userlist['type'].unique()

# Print the unique values for Type
print("Unique values for Type:")
for value in type_unique_values:
    print(value)
print("\n")

# Filtering the Source column
source_unique_values = userlist['source'].unique()

# Print the unique values for Source
print("Unique values for source:")
for value in source_unique_values:
    print(value)
print("\n")

# Filtering the Status column
status_unique_values = userlist['status'].unique()

# Print the unique values for Status
print("Unique values for status:")
for value in status_unique_values:
    print(value)
print("\n")

# Filtering the Rating column
rating_unique_values = userlist['rating'].unique()

# Print the unique values for Rating
print("Unique values for rating:")
for value in rating_unique_values:
    print(value)
print("\n")

# Filtering the Studio column
studio_unique_values = userlist['studio'].unique()

# Print the unique values for Studio
print("Unique values for studio:")
for value in studio_unique_values:
    print(value)
print("\n")

### Printing out range of values for numerical variables

We are interested in finding out the minimum and maximum values for the following columns
1. Episodes
2. Duration
3. Score
4. Score_by
5. Rank
6. Popularity
7. Members
8. Favourites

In [None]:
# Filtering the Episodes column
episode_minimum_value = userlist['episodes'].min()
episode_maximum_value = userlist['episodes'].max()

print("Minimum value in for episode:", episode_minimum_value)
print("Maximum value in for episode:", episode_maximum_value)
print("\n")

# Filtering the Duration column
duration_minimum_value = userlist['duration'].min()
duration_maximum_value = userlist['duration'].max()

print("Minimum value in for duration:", duration_minimum_value)
print("Maximum value in for duration:", duration_maximum_value)
print("\n")

# Filtering the Score column
score_minimum_value = userlist['score'].min()
score_maximum_value = userlist['score'].max()

print("Minimum value in for score:", score_minimum_value)
print("Maximum value in for score:", score_maximum_value)
print("\n")

# Filtering the Rank column
rank_minimum_value = userlist['rank'].min()
rank_maximum_value = userlist['rank'].max()

print("Minimum value in for rank:", rank_minimum_value)
print("Maximum value in for rank:", rank_maximum_value)
print("\n")

# Filtering the Popularity column
popularity_minimum_value = userlist['popularity'].min()
popularity_maximum_value = userlist['popularity'].max()

print("Minimum value in for popularity:", popularity_minimum_value)
print("Maximum value in for popularity:", popularity_maximum_value)
print("\n")

# Filtering the Members column
member_minimum_value = userlist['members'].min()
member_maximum_value = userlist['members'].max()

print("Minimum value in for member:", member_minimum_value)
print("Maximum value in for member:", member_maximum_value)
print("\n")

# Filtering the Favourites column
favourite_minimum_value = userlist['favorites'].min()
favourite_maximum_value = userlist['favorites'].max()

print("Minimum value in for favourite:", favourite_minimum_value)
print("Maximum value in for favourite:", favourite_maximum_value)

### Primary Variables to Consider

# Score 
Score is a weighted average of the rating that users give to a particular anime
It is calculated using the follo winfor::mula

(v / (v + m)) * S + (m / )(  :
:Where

S = Average score 
v = Number users giving a score
m = Minimum number of scored users required to get a calculated score
C = The mean score across the entire database

S and C are essentially the unweighted averages

# Rank

It is not immediately clear how rank is calculated, perhaps we can run a regression operation to reverse engineer the formula...

# Popularity

This is the number of viewers that watched a particular anime

# Members

This is the number of people who added a particular anime to their list

# Favourites

This is the Nnumber of people who favourited a particular anime


### Some reverse engineering

# Step 1
Let us now calculate the correlation coefficients between score, popularity, members and favourites on Rank, in order to eliminate the least relevant variables 

To intepret the values:
> −1 indicates a perfect negative correlation0> > 
0 indicates no correlaton1> 
1 indicates a perfect positive correlaion.

In [None]:
# Calculating correlation coefficient between rank and score 
rank_vs_score_correlation_coefficient = userlist['rank'].corr(userlist['score'])
print(f"Correlation Coefficient between rank and score: {rank_vs_score_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and popularity 
rank_vs_popularity_correlation_coefficient = userlist['rank'].corr(userlist['popularity'])
print(f"Correlation Coefficient between rank and popularity: {rank_vs_popularity_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and members
rank_vs_members_correlation_coefficient = userlist['rank'].corr(userlist['members'])
print(f"Correlation Coefficient between rank and members: {rank_vs_members_correlation_coefficient}")
print("\n")

# Calculating correlation coefficient between rank and favourites
rank_vs_favorites_correlation_coefficient = userlist['rank'].corr(userlist['favorites'])
print(f"Correlation Coefficient between rank and favorites: {rank_vs_favorites_correlation_coefficient}")
print("\n")

Results:
Correlation Coefficient between rank and score: -0.6651938263432281

Correlation Coefficient between rank and popularity: 0.70755999120178


Correlation Coefficient between rank and members: -0.389765535770357


Correlation Coefficient between rank and favorites: -0.18767204

From the results above, we can easily eliminate favourites as a variable, given how close its value is to 0
Members is also up for consideration for elimination63083836


# Step 2.1: A first shot at linear regression

We will now attempt to use a linear regression model to work out the formula for ranked, using score, popularity and members
To do that, we must first split the data into train and test sets

Take note that a seed value of 42 was chosen here for reproducibility

In [None]:
from sklearn.model_selection import train_test_split

# Predictor_3var contains the features (score, popularity, members) 
# Response contains the target variable (rank)
Predictor_3var = userlist[['score', 'popularity', 'members']]
Response = userlist['rank']

# Split the data into training and testing sets (80% training, 20% testing) with a seed of 42
Predictor_3var_train, Predictor_3var_test, Response_train, Response_test = train_test_split(Predictor_3var, Response, test_size=0.2, random_state=42)

After splitting the data, we create a linear regression model and train it using the train set

In [None]:
from sklearn.linear_model import LinearRegression

# Create and train the linear regression model
linear_reg_model_3var = LinearRegression()
linear_reg_model_3var.fit(Predictor_3var_train, Response_train)

After training, we use the model to predict rank on the test set

We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Predict Rank for the test set
Rank_pred_linear_reg_3var = linear_reg_model_3var.predict(Predictor_3var_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(Response_test, Rank_pred_linear_reg_3var, color = "red")
plt.show()

To get a better evaluation of how effective the model is, we use the following metrics:

1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

To evalate these metrics:
> A lower MSE indicates better performance (0 lowest)
> A lower RMSE indicates better performance (0 lowest)
> A lower MAE indicates better performance (0 lowest)
> A higher R^2 indicates better performance (1 highest)

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_3var = mean_squared_error(Response_test, Rank_pred_linear_reg_3var)
print(f"Mean Squared Error: {mse_3var}")

# Calculaing root mean squared error
rmse_3var = mean_squared_error(Response_test, Rank_pred_linear_reg_3var, squared = False)
print(f"Root Mean Squared Error: {rmse_3var}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_3var = mean_absolute_error(Response_test, Rank_pred_linear_reg_3var)
print(f"Mean Absolute Error: {mae_3var}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_3var = r2_score(Response_test, Rank_pred_linear_reg_3var)
print(f"R-squared Score: {r2_3var}")

# Step 2.2: Another crack at linear regression

We will now attempt to use a linear regression model to work out the formula for ranked, using only score and popularity
To do that, we must first split the data into train and test sets

Take note that a seed value of 42 was chosen here for reproducibility

In [None]:
from sklearn.model_selection import train_test_split

# Predictor contains the features (score and popularity) 
# Response contains the target variable (rank)
Predictor_2var = userlist[['score', 'popularity']]
Response = userlist['rank']

# Split the data into training and testing sets (80% training, 20% testing) with a seed of 42
Predictor_2var_train, Predictor_2var_test, Response_train, Response_test = train_test_split(Predictor_2var, Response, test_size=0.2, random_state=42)

After splitting the data, we create a linear regression model and train it using the train set

In [None]:
from sklearn.linear_model import LinearRegression

# Create and train the linear regression model
linear_reg_model_2var = LinearRegression()
linear_reg_model_2var.fit(Predictor_2var_train, Response_train)

After training, we use the model to predict rank on the test set

We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Predict Rank for the test set
Rank_pred_linear_reg_2var = linear_reg_model_2var.predict(Predictor_2var_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(Response_test, Rank_pred_linear_reg_2var, color = "red")
plt.show()

To get a better evaluation of how effective the model is, we use the following metrics:

1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

To evalate these metrics:
> A lower MSE indicates better performance (0 lowest)
> A lower RMSE indicates better performance (0 lowest)
> A lower MAE indicates better performance (0 lowest)
> A higher R^2 indicates better performance (1 highest)

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_2var = mean_squared_error(Response_test, Rank_pred_linear_reg_2var)
print(f"Mean Squared Error: {mse_2var}")

# Calculaing root mean squared error
rmse_2var = mean_squared_error(Response_test, Rank_pred_linear_reg_2var, squared = False)
print(f"Root Mean Squared Error: {rmse_2var}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_2var = mean_absolute_error(Response_test, Rank_pred_linear_reg_2var)
print(f"Mean Absolute Error: {mae_2var}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_2var = r2_score(Response_test, Rank_pred_linear_reg_2var)
print(f"R-squared Score: {r2_2var}")

# Evaluating the two models

Here are the results for the 3 variable linear regression model:
Mean Squared Error: 6269914.252040876
Root Mean Squared Error: 2503.979682833085
Mean Absolute Error: 1828.7656030377157
R-squared Score: 0.6250455051218821

And here are the results for the 2 variable model:
Mean Squared Error: 6307853.471599364
Root Mean Squared Error: 2511.5440413417728
Mean Absolute Error: 1827.6968888255014
R-squared Score: 0.6227766573619637

We can see that they are extremely close, with R^2 scores only differing by around 0.0022. This shows that Members is indeed not a very helpful metric for determining Rank, so it can be safely removed

# Retrieving the formula

Now that we have our chosen model, let us retrieve the exact coefficients and intercepts that were used to generate it

In [None]:
# Retrieve the coefficients and intercept
coefficients_2var = linear_reg_model_2var.coef_
intercept_2var = linear_reg_model_2var.intercept_

# Construct the formula
formula = f"Rank = {intercept_2var:.2f} + "
for i, coef in enumerate(coefficients_2var):
    formula += f"({coef:.2f} * Predictor_{i+1}) + "

# Remove the trailing '+' and whitespace
formula = formula[:-3]

print("Formula:", formula)

We can this see that:
Rank = 10629.83 + (-1127.91 * Score) + (0.48 * Popularity)

# Step 3.1: Let's try polynomial regression

A polynomial regression is able to capture more complex patterns in the data. Maybe that will help us predict Rank better?

We will be re-using the previous train and test sets, Predictor_2var and Response, except they will now be converted into polynomial features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Set the degree of the polynomial
degree = 2  

# Conversion of Predictor_2var_train into polynomial feature
poly2_features = PolynomialFeatures(degree=degree)
Predictor_2var_poly2 = poly2_features.fit_transform(Predictor_2var_train)

# Conversion of Predictor_2var_test into polynomial feature
Predictor_2var_test_poly2 = poly2_features.transform(Predictor_2var_test)

We can then train the model using said polynomial features, and use it to make predictions on the training set for Rank

In [None]:
# Training the model
poly2_model = LinearRegression()
poly2_model.fit(Predictor_2var_poly2, Response_train)

# Predict Rank for test set
Rank_pred_poly2_model = poly2_model.predict(Predictor_2var_test_poly2)

We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Plotting the predictions
f = plt.figure(figsize=(16, 8))
plt.plot(Predictor_2var_test, Rank_pred_poly2_model, color='red', label='Polynomial Regression')
plt.show()

Now that we have the model, let us use the same metrics

1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

To calculate how good it is at predicting Rank

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_poly2 = mean_squared_error(Response_test, Rank_pred_poly2_model)
print(f"Mean Squared Error: {mse_poly2}")

# Calculaing root mean squared error
rmse_poly2 = mean_squared_error(Response_test, Rank_pred_poly2_model, squared = False)
print(f"Root Mean Squared Error: {rmse_poly2}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_poly2 = mean_absolute_error(Response_test, Rank_pred_poly2_model)
print(f"Mean Absolute Error: {mae_poly2}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_poly2 = r2_score(Response_test, Rank_pred_poly2_model)
print(f"R-squared Score: {r2_poly2}")

The results are in!

Remember, this is what we got for the 2 variable linear regression:
Mean Squared Error: 6307853.471599364
Root Mean Squared Error: 2511.5440413417728
Mean Absolute Error: 1827.6968888255014
R-squared Score: 0.6227766573619637

And this is what we get without polynomial regression, with degree 2:
Mean Squared Error: 5319071.671615491
Root Mean Squared Error: 2306.3112694550773
Mean Absolute Error: 1590.2145713248253
R-squared Score: 0.6819079573214

We can see that the results have improved somewhat, with R^2 jumping up by around 0.06 points408

# Step 3.1: Let's try polynomial regression with a higher degree

A higher degree is able to capture even more information. Maybe that will help us predict Rank even better

We will be re-using the previous train and test sets, Predictor_2var and Response, except they will now be converted into polynomial features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Set the degree of the polynomial
degree = 3  

# Conversion of Predictor_2var_train into polynomial feature
poly3_features = PolynomialFeatures(degree=degree)
Predictor_2var_poly3 = poly3_features.fit_transform(Predictor_2var_train)

# Conversion of Predictor_2var_test into polynomial feature
Predictor_2var_test_poly3 = poly3_features.transform(Predictor_2var_test)

We can then train the model using said polynomial features, and use it to make predictions on the training set for Rank

In [None]:
# Training the model
poly3_model = LinearRegression()
poly3_model.fit(Predictor_2var_poly3, Response_train)

# Predict Rank for test set
Rank_pred_poly3_model = poly3_model.predict(Predictor_2var_test_poly3)

We then plot the resulting predictions to check if anything looks unusual

In [None]:
# Plotting the predictions
f = plt.figure(figsize=(16, 8))
plt.plot(Predictor_2var_test, Rank_pred_poly3_model, color='red', label='Polynomial Regression')
plt.show()

Now that we have the model, let us use the same metrics

1. Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values
2. Root Mean Squared Error (RMSE): A measure of the average magnitude of the errors in the predicted values
3. Mean Absolute Error (MAE): The average of the absolute differences between the predicted values and the actual values
4. Coefficient of Deterimination (R^2): The proportion of the variance in the target variable that is explained by the model

To calculate how good it is at predicting Rank

In [None]:
# Calculating mean squared error
from sklearn.metrics import mean_squared_error
mse_poly3 = mean_squared_error(Response_test, Rank_pred_poly3_model)
print(f"Mean Squared Error: {mse_poly3}")

# Calculaing root mean squared error
rmse_poly3 = mean_squared_error(Response_test, Rank_pred_poly3_model, squared = False)
print(f"Root Mean Squared Error: {rmse_poly3}")

# Calculating mean absolute error
from sklearn.metrics import mean_absolute_error
mae_poly3 = mean_absolute_error(Response_test, Rank_pred_poly3_model)
print(f"Mean Absolute Error: {mae_poly3}")

# Calculating coefficient of determination
from sklearn.metrics import r2_score
r2_poly3 = r2_score(Response_test, Rank_pred_poly3_model)
print(f"R-squared Score: {r2_poly3}")

The results are in!

Remember, this is what we got for the 2 variable linear regression:
Mean Squared Error: 6307853.471599364
Root Mean Squared Error: 2511.5440413417728
Mean Absolute Error: 1827.6968888255014
R-squared Score: 0.6227766573619637

And this is what we get without polynomial regression, with degree 3:
Mean Squared Error: 4852202.752170597
Root Mean Squared Error: 2202.771606901314
Mean Absolute Error: 1550.076935413968
R-squared Score: 0.7098277330676199214

We can see that the s have improved, albeit marginally. The improvement is still better than going from 2 variable linear regression to 3 variable linear regression (0.2 compared to 0.0022), so it is worthwhile to keep the power at 3. Any more, though, and the model might become susceptible to overfitting. nts408

# Retrieving the formula

Now that we have our polynomial model, let us retrieve the exact coefficients and intercepts that were used to generate it also

In [None]:
# Assuming 'model' is the trained polynomial regression model

# Extract coefficients and intercept
coefficients_poly3 = poly3_model.coef_
intercept_poly3 = poly3_model.intercept_

# Construct the formula
formula = f'y = {intercept_poly3:.2f}'
for i, coef in enumerate(coefficients_poly3):
    formula += f' + {coef:.2f} * x^{i+1}'

print("Formula:", formula)

We can this see that:
Rank = 5009.94 + 2224.53 * x^2 + 3.10 * x^3 + -316.36 * x^4 + -0.50 * x^5 + -4.92 * x^7 + 0.02 * x^8