## 2021-2022 Football Player Stats

1. **Data Preprocessing:** This involves loading each dataset, handling missing values, encoding categorical variables, and splitting the data into features (X) and target (y).

2. **Model Training and Parameter Tuning:**

- For each model (Decision Tree, Random Forest, CatBoost, Naive Bayes, Logistic Regression, XGBoost, KNN), we'll train the model using the training data.
- We'll use cross-validation (at least 5-fold) to tune the model parameters.
3. **Model Evaluation:**
After training, we'll evaluate each model using accuracy and F1-score.

4. **Output the Results:** Provide a summary of the performance of each model on each dataset.

we'll use the 'Pos' (position) column as the target variable for classification. The next steps in data preprocessing are:

1. **Handle Missing Values:** Check for and handle any missing values in the dataset.
2. **Encode Categorical Variables:** Convert categorical variables into a format suitable for machine learning models.
3. **Feature Selection:** Select relevant features for the classification task, excluding the target variable and other non-informative columns like player names.
4. **Data Splitting:** Split the data into features (X) and target (y), and then into training and testing sets.

In [45]:
import pandas as pd

football_data = pd.read_csv('2021-2022 Football Player Stats.csv', encoding='ISO-8859-1', delimiter=';')

# Display the first few rows and summary information
football_data.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,Off,Crs,TklW,PKwon,PKcon,OG,Recov,AerWon,AerLost,AerWon%
0,1,Max Aarons,ENG,DF,Norwich City,Premier League,22.0,2000,34,32,...,0.03,1.41,1.16,0.0,0.06,0.03,5.53,0.47,1.59,22.7
1,2,Yunis Abdelhamid,MAR,DF,Reims,Ligue 1,34.0,1987,34,34,...,0.0,0.06,1.39,0.0,0.03,0.0,6.77,2.02,1.36,59.8
2,3,Salis Abdul Samed,GHA,MF,Clermont Foot,Ligue 1,22.0,2000,31,29,...,0.0,0.36,1.24,0.0,0.0,0.0,8.76,0.88,0.88,50.0
3,4,Laurent Abergel,FRA,MF,Lorient,Ligue 1,29.0,1993,34,34,...,0.03,0.79,2.23,0.0,0.0,0.0,8.87,0.43,0.43,50.0
4,5,Charles Abi,FRA,FW,Saint-Étienne,Ligue 1,22.0,2000,1,1,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0,100.0


In [46]:
football_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2921 entries, 0 to 2920
Columns: 143 entries, Rk to AerWon%
dtypes: float64(133), int64(5), object(5)
memory usage: 3.2+ MB


In [47]:
football_data.describe()

Unnamed: 0,Rk,Age,Born,MP,Starts,Min,90s,Goals,Shots,SoT,...,Off,Crs,TklW,PKwon,PKcon,OG,Recov,AerWon,AerLost,AerWon%
count,2921.0,2920.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,...,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0,2921.0
mean,1461.0,26.092123,1994.725094,18.800068,13.749743,1234.756248,13.719069,0.111274,1.220431,0.391462,...,0.195029,1.063708,1.026689,0.01519,0.01305,0.003451,7.410294,1.664286,1.858305,40.485861
std,843.364393,4.641746,37.210426,11.619882,11.393763,977.941288,10.865255,0.233688,1.511266,0.784754,...,0.509935,1.341177,1.028466,0.21167,0.052864,0.020306,3.650554,1.938046,2.245096,24.29729
min,1.0,16.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,731.0,23.0,1992.0,8.0,3.0,307.0,3.4,0.0,0.28,0.0,...,0.0,0.0,0.43,0.0,0.0,0.0,5.07,0.52,0.84,26.2
50%,1461.0,26.0,1996.0,20.0,12.0,1102.0,12.2,0.0,0.82,0.19,...,0.0,0.59,0.92,0.0,0.0,0.0,7.38,1.23,1.39,43.5
75%,2191.0,29.0,1999.0,29.0,23.0,2025.0,22.5,0.15,1.83,0.56,...,0.2,1.7,1.41,0.0,0.0,0.0,9.38,2.27,2.21,57.1
max,2921.0,41.0,2006.0,38.0,38.0,3420.0,38.0,5.0,20.0,20.0,...,10.0,15.0,10.0,10.0,1.43,0.5,40.0,30.0,40.0,100.0


In [48]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Handling missing values
# For simplicity, we'll fill numeric columns with the median and categorical with the mode
football_data_filled = football_data.fillna(football_data.median(numeric_only=True))
for column in football_data_filled.select_dtypes(include=['object']):
    football_data_filled[column].fillna(football_data_filled[column].mode()[0], inplace=True)

# Encoding categorical variables
le = LabelEncoder()
for column in football_data_filled.select_dtypes(include=['object']):
    football_data_filled[column] = le.fit_transform(football_data_filled[column])

# Feature Selection
# Removing non-informative columns like 'Player', 'Nation', 'Squad', 'Comp', 'Rk'
features = football_data_filled.drop(columns=['Player', 'Nation', 'Squad', 'Comp', 'Rk', 'Pos'])
target = football_data_filled['Pos']

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((2336, 137), (585, 137), (2336,), (585,))

The data preprocessing for the "2021-2022 Football Player Stats" dataset is complete. We've handled missing values, encoded categorical variables, performed feature selection (excluding non-informative columns like 'Player', 'Nation', 'Squad', 'Comp', 'Rk'), and split the data into training and testing sets.

The dataset is now divided as follows:

- Training Features: 2336 samples, 137 features
- Testing Features: 585 samples, 137 features
- Training Targets: 2336 samples
- Testing Targets: 585 samples

# **DecisionTreeClassifier**

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Define scorers for cross-validation
accuracy_scorer = make_scorer(accuracy_score)
f1_scorer = make_scorer(f1_score, average='weighted')

# Perform 5-fold cross-validation for accuracy and F1-score
dt_accuracy_scores = cross_val_score(dt_classifier, X_train, y_train, cv=5, scoring=accuracy_scorer)
dt_f1_scores = cross_val_score(dt_classifier, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
dt_avg_accuracy = np.mean(dt_accuracy_scores)
dt_avg_f1_score = np.mean(dt_f1_scores)

dt_avg_accuracy, dt_avg_f1_score



(0.6429738831237761, 0.6452571820720376)

The Decision Tree model's performance on the "2021-2022 Football Player Stats" dataset, evaluated using 5-fold cross-validation, is as follows:

- Average Accuracy: 64.30%
- Average F1-Score: 64.53%


# **Random Forest Classifier**

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Perform 5-fold cross-validation for accuracy and F1-score
rf_accuracy_scores = cross_val_score(rf_classifier, X_train, y_train, cv=5, scoring=accuracy_scorer)
rf_f1_scores = cross_val_score(rf_classifier, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
rf_avg_accuracy = np.mean(rf_accuracy_scores)
rf_avg_f1_score = np.mean(rf_f1_scores)

rf_avg_accuracy, rf_avg_f1_score





(0.7410045937883197, 0.7084425364465041)

The Random Forest model's performance on the "2021-2022 Football Player Stats" dataset, evaluated using 5-fold cross-validation, is as follows:

- Average Accuracy: 74.10%
- Average F1-Score: 70.84%

# **Gaussian Naive Bayes Classifier**

In [7]:
from sklearn.naive_bayes import GaussianNB

# Initialize Gaussian Naive Bayes Classifier
nb_classifier = GaussianNB()

# Perform 5-fold cross-validation for accuracy and F1-score
nb_accuracy_scores = cross_val_score(nb_classifier, X_train, y_train, cv=5, scoring=accuracy_scorer)
nb_f1_scores = cross_val_score(nb_classifier, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
nb_avg_accuracy = np.mean(nb_accuracy_scores)
nb_avg_f1_score = np.mean(nb_f1_scores)

nb_avg_accuracy, nb_avg_f1_score





(0.45975585204707264, 0.5035049736554909)

In [8]:
!pip install xgboost
!pip install catboost

Installing collected packages: catboost
Successfully installed catboost-1.2.2


# **XGBoost Classifier**

In [None]:
from xgboost import XGBClassifier


le_y_train = LabelEncoder()
y_train_encoded = le_y_train.fit_transform(y_train)

# Initialize XGBoost Classifier
xgb_classifier = XGBClassifier(random_state=42)

# Perform 5-fold cross-validation
xgb_accuracy_scores = cross_val_score(xgb_classifier, X_train, y_train_encoded, cv=5, scoring=accuracy_scorer)
xgb_f1_scores = cross_val_score(xgb_classifier, X_train, y_train_encoded, cv=5, scoring=f1_scorer)

# Calculate average scores
xgb_avg_accuracy = np.mean(xgb_accuracy_scores)
xgb_avg_f1_score = np.mean(xgb_f1_scores)

xgb_avg_accuracy, xgb_avg_f1_score



# **CatBoost Classifier**

In [10]:
from catboost import CatBoostClassifier

# Initialize CatBoost Classifier
catboost_classifier = CatBoostClassifier(random_state=42, verbose=0)  # verbose=0 to avoid lengthy outputs

# Perform 5-fold cross-validation
catboost_accuracy_scores = cross_val_score(catboost_classifier, X_train, y_train, cv=5, scoring=accuracy_scorer)
catboost_f1_scores = cross_val_score(catboost_classifier, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
catboost_avg_accuracy = np.mean(catboost_accuracy_scores)
catboost_avg_f1_score = np.mean(catboost_f1_scores)

catboost_avg_accuracy, catboost_avg_f1_score




(0.7521404125258515, 0.7259391707790144)

# **KNN Classifier**

In [11]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize KNN Classifier
knn_classifier = KNeighborsClassifier()

# Perform 5-fold cross-validation
knn_accuracy_scores = cross_val_score(knn_classifier, X_train, y_train, cv=5, scoring=accuracy_scorer)
knn_f1_scores = cross_val_score(knn_classifier, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
knn_avg_accuracy = np.mean(knn_accuracy_scores)
knn_avg_f1_score = np.mean(knn_f1_scores)

knn_avg_accuracy, knn_avg_f1_score




(0.5492239975109354, 0.5130279240904185)

# **Logistic Regression**

In [12]:
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)  # Increased max_iter for convergence

# Perform 5-fold cross-validation
log_reg_accuracy_scores = cross_val_score(log_reg, X_train, y_train, cv=5, scoring=accuracy_scorer)
log_reg_f1_scores = cross_val_score(log_reg, X_train, y_train, cv=5, scoring=f1_scorer)

# Calculate average scores
log_reg_avg_accuracy = np.mean(log_reg_accuracy_scores)
log_reg_avg_f1_score = np.mean(log_reg_f1_scores)

log_reg_avg_accuracy, log_reg_avg_f1_score


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(0.7362982485038161, 0.713971525755135)

# **Top 1000 movies from IMDb**

1. `Poster_Link`: URL link to the movie poster.
2. `Series_Title`: Title of the movie.
3. `Released_Year`: Year the movie was released.
4. `Certificate`: Certification of the movie.
5. `Runtime`: Duration of the movie.
6. `Genre`: Genre(s) of the movie.
7. `IMDB_Rating`: IMDb rating of the movie.
8. `Overview`: Brief overview of the movie plot.
9. `Meta_score`: Metascore rating of the movie.
10. `Director`: Director of the movie.
11. `Star1`, `Star2`, `Star3`, `Star4`: Stars of the movie.
12. `No_of_Votes`: Number of votes the movie received on IMDb.
13. `Gross`: Gross revenue of the movie.





In [13]:
import pandas as pd

file_path = 'imdb_top_1000.csv'

# Reading the file
imdb_data = pd.read_csv(file_path)
imdb_data.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [16]:
# Data Preprocessing for the IMDb dataset

# Handling missing values: filling numeric columns with the median and categorical with the mode
imdb_data_filled = imdb_data.fillna(imdb_data.median(numeric_only=True))
for column in imdb_data_filled.select_dtypes(include=['object']):
    imdb_data_filled[column].fillna(imdb_data_filled[column].mode()[0], inplace=True)

# Since 'Genre' is a multi-label column, we'll simplify it by taking only the first listed genre for each movie
# This is to simplify the example to a single-label classification task
imdb_data_filled['Genre'] = imdb_data_filled['Genre'].apply(lambda x: x.split(',')[0])

# Encoding categorical variables
le = LabelEncoder()
for column in imdb_data_filled.select_dtypes(include=['object']):
    if column != 'Genre':  # Excluding the target variable 'Genre'
        imdb_data_filled[column] = le.fit_transform(imdb_data_filled[column])

# Feature Selection: Removing non-informative columns like 'Poster_Link', 'Series_Title', 'Overview', 'Star1', 'Star2', 'Star3', 'Star4'
features = imdb_data_filled.drop(columns=['Poster_Link', 'Series_Title', 'Overview', 'Star1', 'Star2', 'Star3', 'Star4', 'Genre'])
target = imdb_data_filled['Genre']

# Count the number of instances for each genre
genre_counts = imdb_data_filled['Genre'].value_counts()

# Filter out genres with less than a certain threshold (e.g., less than 10 instances)
threshold = 10
genres_to_keep = genre_counts[genre_counts >= threshold].index

# Filter the dataset to only include these genres
imdb_data_filtered = imdb_data_filled[imdb_data_filled['Genre'].isin(genres_to_keep)]

# Filtering the features to match the filtered genres
features_filtered = features.loc[imdb_data_filtered.index]

# Splitting the filtered data into training and testing sets
X_train_imdb, X_test_imdb, y_train_imdb, y_test_imdb = train_test_split(features_filtered, imdb_data_filtered['Genre'], test_size=0.2, random_state=42, stratify=imdb_data_filtered['Genre'])

X_train_imdb.shape, X_test_imdb.shape, y_train_imdb.shape, y_test_imdb.shape





((790, 8), (198, 8), (790,), (198,))

# **Decision Tree Classifier**

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# Initialize the Decision Tree Classifier
dt_classifier_imdb = DecisionTreeClassifier(random_state=42)

# Define scorers for cross-validation
accuracy_scorer = make_scorer(accuracy_score)
f1_scorer = make_scorer(f1_score, average='weighted')

# Perform 5-fold cross-validation for accuracy and F1-score
dt_accuracy_scores_imdb = cross_val_score(dt_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
dt_f1_scores_imdb = cross_val_score(dt_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
dt_avg_accuracy_imdb = np.mean(dt_accuracy_scores_imdb)
dt_avg_f1_score_imdb = np.mean(dt_f1_scores_imdb)
dt_avg_accuracy_imdb, dt_avg_f1_score_imdb


(0.2417721518987342, 0.2409621993382498)

# **Random Forest Classifier**

In [20]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier_imdb = RandomForestClassifier(random_state=42)

# Perform 5-fold cross-validation
rf_accuracy_scores_imdb = cross_val_score(rf_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
rf_f1_scores_imdb = cross_val_score(rf_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
rf_avg_accuracy_imdb = np.mean(rf_accuracy_scores_imdb)
rf_avg_f1_score_imdb = np.mean(rf_f1_scores_imdb)
rf_avg_accuracy_imdb, rf_avg_f1_score_imdb

(0.34050632911392403, 0.3009111671922525)

# **Gaussian Naive Bayes Classifier**

In [22]:
from sklearn.naive_bayes import GaussianNB

# Initialize Gaussian Naive Bayes Classifier
nb_classifier_imdb = GaussianNB()

# Perform 5-fold cross-validation
nb_accuracy_scores_imdb = cross_val_score(nb_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
nb_f1_scores_imdb = cross_val_score(nb_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
nb_avg_accuracy_imdb = np.mean(nb_accuracy_scores_imdb)
nb_avg_f1_score_imdb = np.mean(nb_f1_scores_imdb)
nb_avg_accuracy_imdb, nb_avg_f1_score_imdb

(0.31645569620253167, 0.2601339592485509)

# **XGBoost Classifier**

In [26]:
from xgboost import XGBClassifier

# Encoding the target variable
le_genre = LabelEncoder()
y_train_imdb_encoded = le_genre.fit_transform(y_train_imdb)
y_test_imdb_encoded = le_genre.transform(y_test_imdb)

# Initialize XGBoost Classifier
xgb_classifier_imdb = XGBClassifier(random_state=42)

# Perform 5-fold cross-validation
xgb_accuracy_scores_imdb = cross_val_score(xgb_classifier_imdb, X_train_imdb, y_train_imdb_encoded, cv=5, scoring=accuracy_scorer)
xgb_f1_scores_imdb = cross_val_score(xgb_classifier_imdb, X_train_imdb, y_train_imdb_encoded, cv=5, scoring=f1_scorer)

# Calculate average scores
xgb_avg_accuracy_imdb = np.mean(xgb_accuracy_scores_imdb)
xgb_avg_f1_score_imdb = np.mean(xgb_f1_scores_imdb)
xgb_avg_accuracy_imdb, xgb_avg_f1_score_imdb


(0.3329113924050633, 0.31129578709843947)

# **CatBoost Classifier**

In [24]:
from catboost import CatBoostClassifier

# Initialize CatBoost Classifier
catboost_classifier_imdb = CatBoostClassifier(random_state=42, verbose=0)  # verbose=0 to avoid lengthy outputs

# Perform 5-fold cross-validation
catboost_accuracy_scores_imdb = cross_val_score(catboost_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
catboost_f1_scores_imdb = cross_val_score(catboost_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
catboost_avg_accuracy_imdb = np.mean(catboost_accuracy_scores_imdb)
catboost_avg_f1_score_imdb = np.mean(catboost_f1_scores_imdb)


In [25]:
catboost_avg_accuracy_imdb, catboost_avg_f1_score_imdb

(0.3329113924050633, 0.31511706348956564)

# **KNN Classifier**

In [27]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize KNN Classifier
knn_classifier_imdb = KNeighborsClassifier()

# Perform 5-fold cross-validation
knn_accuracy_scores_imdb = cross_val_score(knn_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
knn_f1_scores_imdb = cross_val_score(knn_classifier_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
knn_avg_accuracy_imdb = np.mean(knn_accuracy_scores_imdb)
knn_avg_f1_score_imdb = np.mean(knn_f1_scores_imdb)
knn_avg_accuracy_imdb, knn_avg_f1_score_imdb

(0.2037974683544304, 0.19092492593710284)

# **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression
log_reg_imdb = LogisticRegression(random_state=42, max_iter=1000)  # Increased max_iter for convergence

# Perform 5-fold cross-validation
log_reg_accuracy_scores_imdb = cross_val_score(log_reg_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=accuracy_scorer)
log_reg_f1_scores_imdb = cross_val_score(log_reg_imdb, X_train_imdb, y_train_imdb, cv=5, scoring=f1_scorer)

# Calculate average scores
log_reg_avg_accuracy_imdb = np.mean(log_reg_accuracy_scores_imdb)
log_reg_avg_f1_score_imdb = np.mean(log_reg_f1_scores_imdb)


In [29]:
log_reg_avg_accuracy_imdb, log_reg_avg_f1_score_imdb

(0.3075949367088608, 0.19885554473229744)

# **Spotify dataset for the year 2023**

has been successfully read. The dataset contains the following columns:

1. `track_name`: Name of the track.
2. `artist(s)_name`: Name of the artist(s).
3. `artist_count`: Number of artists on the track.
4. `released_year`, `released_month`, `released_day`: Release date information.
5. `in_spotify_playlists`: Number of Spotify playlists the track is in.
6. `in_spotify_charts`: Number of Spotify charts the track is in.
7. `streams`: Number of streams.
8. `in_apple_playlists`: Number of Apple playlists the track is in.
9. Other music-related features like `bpm` (beats per minute), `key`, `mode`, percentages for `danceability`, `valence`, `energy`, `acousticness`, `instrumentalness`, `liveness`, `speechiness`.



In [32]:
spotify_data = pd.read_csv("spotify-2023.csv", encoding='ISO-8859-1')
spotify_data.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


In [33]:
# Data Preprocessing for the Spotify dataset

# Handling missing values: filling numeric columns with the median and categorical with the mode
spotify_data_filled = spotify_data.fillna(spotify_data.median(numeric_only=True))
for column in spotify_data_filled.select_dtypes(include=['object']):
    spotify_data_filled[column].fillna(spotify_data_filled[column].mode()[0], inplace=True)

# Encoding categorical variables excluding the target variable 'mode'
le_spotify = LabelEncoder()
for column in spotify_data_filled.select_dtypes(include=['object']):
    if column != 'mode':  # Excluding the target variable 'mode'
        spotify_data_filled[column] = le_spotify.fit_transform(spotify_data_filled[column])

# Feature Selection: Removing columns that might not be informative for the model
features_spotify = spotify_data_filled.drop(columns=['track_name', 'artist(s)_name', 'mode'])
target_spotify = spotify_data_filled['mode']

# Splitting data into training and testing sets
X_train_spotify, X_test_spotify, y_train_spotify, y_test_spotify = train_test_split(features_spotify, target_spotify, test_size=0.2, random_state=42, stratify=target_spotify)

X_train_spotify.shape, X_test_spotify.shape, y_train_spotify.shape, y_test_spotify.shape



((762, 21), (191, 21), (762,), (191,))

- Training Features: 762 samples, 21 features
- Testing Features: 191 samples, 21 features
- Training Targets: 762 samples
- Testing Targets: 191 samples

# **Decision Tree Classifier**

In [34]:
dt_classifier_spotify = DecisionTreeClassifier(random_state=42)
dt_accuracy_scores_spotify = cross_val_score(dt_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
dt_f1_scores_spotify = cross_val_score(dt_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
dt_avg_accuracy_spotify = np.mean(dt_accuracy_scores_spotify)
dt_avg_f1_score_spotify = np.mean(dt_f1_scores_spotify)
dt_avg_accuracy_spotify, dt_avg_f1_score_spotify

(0.5445734434124527, 0.5433899061208894)

# **Random Forest Classifier**

In [36]:
rf_classifier_spotify = RandomForestClassifier(random_state=42)
rf_accuracy_scores_spotify = cross_val_score(rf_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
rf_f1_scores_spotify = cross_val_score(rf_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
rf_avg_accuracy_spotify = np.mean(rf_accuracy_scores_spotify)
rf_avg_f1_score_spotify = np.mean(rf_f1_scores_spotify)
rf_avg_accuracy_spotify, rf_avg_f1_score_spotify

(0.5958032335741315, 0.5740080890169604)

# **Gaussian Naive Bayes Classifier**

In [37]:
nb_classifier_spotify = GaussianNB()
nb_accuracy_scores_spotify = cross_val_score(nb_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
nb_f1_scores_spotify = cross_val_score(nb_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
nb_avg_accuracy_spotify = np.mean(nb_accuracy_scores_spotify)
nb_avg_f1_score_spotify = np.mean(nb_f1_scores_spotify)
nb_avg_accuracy_spotify, nb_avg_f1_score_spotify

(0.5604403164774683, 0.5604126080890504)

# **XGBoost Classifier**

In [40]:
# Encoding the target variable
le_mode = LabelEncoder()
y_train_spotify_encoded = le_mode.fit_transform(y_train_spotify)
y_test_spotify_encoded = le_mode.transform(y_test_spotify)

# Initialize XGBoost Classifier
xgb_classifier_spotify = XGBClassifier(random_state=42)

# Perform 5-fold cross-validation
xgb_accuracy_scores_spotify = cross_val_score(xgb_classifier_spotify, X_train_spotify, y_train_spotify_encoded, cv=5, scoring=accuracy_scorer)
xgb_f1_scores_spotify = cross_val_score(xgb_classifier_spotify, X_train_spotify, y_train_spotify_encoded, cv=5, scoring=f1_scorer)

# Calculate average scores
xgb_avg_accuracy_spotify = np.mean(xgb_accuracy_scores_spotify)
xgb_avg_f1_score_spotify = np.mean(xgb_f1_scores_spotify)
xgb_avg_accuracy_spotify, xgb_avg_f1_score_spotify

(0.6233316133470932, 0.6196829761197395)

# **CatBoost Classifier**

In [41]:
catboost_classifier_spotify = CatBoostClassifier(random_state=42, verbose=0)  # verbose=0 to avoid lengthy outputs
catboost_accuracy_scores_spotify = cross_val_score(catboost_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
catboost_f1_scores_spotify = cross_val_score(catboost_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
catboost_avg_accuracy_spotify = np.mean(catboost_accuracy_scores_spotify)
catboost_avg_f1_score_spotify = np.mean(catboost_f1_scores_spotify)
catboost_avg_accuracy_spotify, catboost_avg_f1_score_spotify

(0.5984434124527004, 0.5867118360698348)

# **KNN Classifier**

In [42]:
knn_classifier_spotify = KNeighborsClassifier()
knn_accuracy_scores_spotify = cross_val_score(knn_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
knn_f1_scores_spotify = cross_val_score(knn_classifier_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
knn_avg_accuracy_spotify = np.mean(knn_accuracy_scores_spotify)
knn_avg_f1_score_spotify = np.mean(knn_f1_scores_spotify)
knn_avg_accuracy_spotify, knn_avg_f1_score_spotify

(0.5327399380804954, 0.5282961919614629)

# **Logistic Regression**

In [None]:
log_reg_spotify = LogisticRegression(random_state=42, max_iter=1000)
log_reg_accuracy_scores_spotify = cross_val_score(log_reg_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=accuracy_scorer)
log_reg_f1_scores_spotify = cross_val_score(log_reg_spotify, X_train_spotify, y_train_spotify, cv=5, scoring=f1_scorer)
log_reg_avg_accuracy_spotify = np.mean(log_reg_accuracy_scores_spotify)
log_reg_avg_f1_score_spotify = np.mean(log_reg_f1_scores_spotify)

In [44]:
log_reg_avg_accuracy_spotify, log_reg_avg_f1_score_spotify

(0.5696078431372549, 0.5455150581669634)