# Analysis of 40+ Years of Video Game Data

- This Analysis involves NLP Techniques, Statistical modeling of Video Game Trends, Coorelation Matrices, Sentiment Analysis, various visualizations, Time Series Data, and lastly utilizing Machine Learning Models to create a genre classification model.

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.sentiment import SentimentIntensityAnalyzer
from datetime import datetime
import plotly.graph_objects as go

In [69]:
# Dataset of Popular Video Games (1980 - 2023) 🎮 

data = pd.read_csv('/Users/adishsundar/Desktop/FinalPortfolio/games.csv')

In [9]:
# There are 1512 Games in this dataset. The highest rating is a 4.8, with an average rating of 3.7

# Interestingly enough, the Skew is -1, meaning that there is a high concentration of data on the right side of 
# ratings, showcasing that most ratings are higher rather than lower.

print(len(data["Rating"]))

data.agg(
    {
        "Rating": ["min", "max", "median", "skew", "mean"]
    }
)

1512


Unnamed: 0,Rating
min,0.7
max,4.8
median,3.8
skew,-1.005106
mean,3.719346


In [10]:
data.sample(3)

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
1391,1391,100% Orange Juice,"Aug 15, 2009","['Orange_Juice', 'Fruitbat Factory']",3.4,112,112,"['Card & Board Game', 'Indie', 'Strategy', 'Tu...",100% Orange Juice is a goal-oriented boardgame...,"[""Played with a few friends, had a mostly awfu...",1.8K,51,292,89
972,972,The Legend of Zelda: Tears of the Kingdom,"May 12, 2023","['Nintendo', 'Nintendo EPD Production Group No...",,581,581,"['Adventure', 'RPG']",The Legend of Zelda: Tears of the Kingdom is t...,[],72,6,1.6K,5.4K
904,904,Dead Space,"Jan 27, 2023","['Motive Studios', 'Electronic Arts']",4.3,501,501,"['Adventure', 'RPG', 'Shooter']",The sci-fi survival horror classic Dead Space ...,['Pretty damn good\n \n ...,1.3K,248,985,1.9K


In [70]:
# Unnecessary Column

del data['Unnamed: 0']

In [12]:
# Look at highest rated games to explore what makes a game highly rated
# Adventure seems to be a genre that is within all of the top 10 highest rated games

data.sort_values(by='Rating', ascending = False)[0:10]

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
1252,Elden Ring: Shadow of the Erdtree,releases on TBD,"['FromSoftware', 'Bandai Namco Entertainment']",4.8,18,18,"['Adventure', 'RPG']",An expansion to Elden Ring setting players on ...,['I really loved that they integrated Family G...,1,0,39,146
28,Disco Elysium: The Final Cut,"May 01, 2020",['ZA/UM'],4.6,1.1K,1.1K,"['Adventure', 'Indie', 'RPG']",Disco Elysium: The Final Cut is a groundbreaki...,"[""a captivating journey from start to finish. ...",6K,1.2K,5K,2.7K
1035,Bloodborne: The Old Hunters,"Nov 24, 2015","['FromSoftware', 'Sony Computer Entertainment']",4.6,266,266,"['Adventure', 'RPG']",The Old Hunters is the first Expansion for Blo...,['HAHA FINALMENTE MATEI O LAWRENCE E O ÓRFÃO D...,4.4K,68,930,616
539,Umineko: When They Cry Chiru,"Sep 15, 2009",['07th Expansion'],4.6,324,324,"['Adventure', 'Visual Novel']",Umineko no Naku Koro ni Chiru is the second ha...,"['cried like a little bitch ngl', ""God I reall...",1.7K,108,582,493
369,Outer Wilds,"May 28, 2019","['Mobius Digital', 'Annapurna Interactive']",4.6,1.8K,1.8K,"['Adventure', 'Indie', 'Puzzle', 'Simulator']",Outer Wilds is a critically-acclaimed and awar...,"['Replayed with my girlfriend, still the best ...",7.7K,661,4.8K,3.1K
354,Disco Elysium: The Final Cut,"May 01, 2020",['ZA/UM'],4.6,1.1K,1.1K,"['Adventure', 'Indie', 'RPG']",Disco Elysium: The Final Cut is a groundbreaki...,"[""a captivating journey from start to finish. ...",6K,1.2K,5K,2.7K
43,Outer Wilds,"May 28, 2019","['Mobius Digital', 'Annapurna Interactive']",4.6,1.8K,1.8K,"['Adventure', 'Indie', 'Puzzle', 'Simulator']",Outer Wilds is a critically-acclaimed and awar...,"['Replayed with my girlfriend, still the best ...",7.7K,661,4.8K,3.1K
717,Hitman World of Assassination,"Jan 26, 2023","['Inlusio Interactive', 'IO Interactive']",4.6,38,38,"['Adventure', 'Shooter', 'Tactical']",Become Agent 47 in the ultimate spy-thriller a...,"['Aunque ya había jugado a los tres Hitman, va...",167,47,54,54
297,Bloodborne: The Old Hunters,"Nov 24, 2015","['FromSoftware', 'Sony Computer Entertainment']",4.6,266,266,"['Adventure', 'RPG']",The Old Hunters is the first Expansion for Blo...,['HAHA FINALMENTE MATEI O LAWRENCE E O ÓRFÃO D...,4.4K,68,930,616
804,Disco Elysium: The Final Cut,"May 01, 2020",['ZA/UM'],4.6,1.1K,1.1K,"['Adventure', 'Indie', 'RPG']",Disco Elysium: The Final Cut is a groundbreaki...,"[""a captivating journey from start to finish. ...",6K,1.2K,5K,2.7K


In [9]:
# This visualization shows out of the games rated above a 4.0, what percentage of games have 
# Adventure as one of the genres.

filtered_df = data[(data["Rating"] > 4.0) & (data["Genres"].str.contains("Adventure"))]
percentage = (len(filtered_df) / len(data[data["Rating"] > 4.0])) * 100

# Create a Plotly figure with a dark background
fig = go.Figure()
fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',
    paper_bgcolor='rgb(17, 17, 17)',
    font_color='white'
)

# Add the bar plot
fig.add_trace(go.Bar(
    x=["With Adventure Genre", "Without Adventure Genre"],
    y=[percentage, 100 - percentage],
    marker=dict(color=["skyblue", "lightgray"]),
    text=[f"{percentage:.1f}%", f"{100 - percentage:.1f}%"],
    textposition="outside"
))

# Set axis labels and title
fig.update_xaxes(title_text="Games", tickfont=dict(color="white"))
fig.update_yaxes(title_text="Percentage", tickfont=dict(color="white"))
fig.update_layout(title="Popular Video Games Rated Above 4.0", title_font_color="white")

# Adjust the layout
fig.update_layout(barmode='stack')
fig.update_traces(texttemplate='%{text}', textfont=dict(color='black'))

# Display the plot
fig.show()

In [14]:
# We can create a coorelation matrix after converting the categorical genres into binary variables using
# one-hot encoding. 

# I noticed that some variables were being counted twice due to brackets, so I used regex to work around this. 

# The results support the findings above, the RPG and Adventure genres seem to have the highest 
# positive coorelatons to Ratings. 

# I recreated the above bar graph with RPG, but the percentage was much lower,
# this leads to me to believe that Adventure is clearly the genre that has the strongest positive impact on the
# rating of a game.

# Remove brackets and extra whitespace from genres
data["Genres"] = data["Genres"].str.replace(r"\[|\]", "").str.strip()

# Convert genres into binary columns using one-hot encoding
genres = data["Genres"].str.get_dummies(sep=", ")

# Concatenate the one-hot encoded genres with the original DataFrame
df_encoded = pd.concat([data, genres], axis=1)

# Calculate the correlation matrix
correlation_matrix = df_encoded.corr()

# Extract the correlation values for the "Rating" column
rating_correlation = correlation_matrix["Rating"]

# Sort the correlation values in descending order
sorted_correlation = rating_correlation.drop("Rating").sort_values(ascending=False)

# Print the top correlations between genres and ratings
sorted_correlation

  data["Genres"] = data["Genres"].str.replace(r"\[|\]", "").str.strip()


'RPG'                    0.154056
'Adventure'              0.152945
'Visual Novel'           0.118716
'Turn Based Strategy'    0.083669
'Puzzle'                 0.058194
'Brawler'                0.034218
'Tactical'               0.012788
'Real Time Strategy'    -0.009134
'Pinball'               -0.010644
'Indie'                 -0.014648
'Card & Board Game'     -0.019625
'Platform'              -0.021084
'Point-and-Click'       -0.023901
'Simulator'             -0.030900
'Music'                 -0.034084
'Quiz/Trivia'           -0.043169
'Sport'                 -0.050717
'Shooter'               -0.051751
'Racing'                -0.055520
'Arcade'                -0.060311
'Strategy'              -0.087058
'MOBA'                  -0.119377
'Fighting'              -0.123594
Name: Rating, dtype: float64

In [15]:
# I also wanted to create a coorelation matrix that showed the top teams coorelated with the highest ratings.
# to see if certian teams made up a large percentage of the highest rated games.
# I used the same process as above.

# Remove brackets and extra whitespace from genres
data["Team"] = data["Team"].str.replace(r"\[|\]", "").str.strip()

# Convert genres into binary columns using one-hot encoding
teams = data["Team"].str.get_dummies(sep=", ")

# Concatenate the one-hot encoded genres with the original DataFrame
df_encoded = pd.concat([data, teams], axis=1)

# Calculate the correlation matrix
correlation_matrix = df_encoded.corr()

# Extract the correlation values for the "Rating" column
rating_correlation = correlation_matrix["Rating"]

# Sort the correlation values in descending order
sorted_correlation = rating_correlation.drop("Rating").sort_values(ascending=False)

# Print the top 3 correlations between teams and ratings
sorted_correlation[0:3]

  data["Team"] = data["Team"].str.replace(r"\[|\]", "").str.strip()


'FromSoftware'                      0.115589
'ZA/UM'                             0.104855
'Sony Interactive Entertainment'    0.093589
Name: Rating, dtype: float64

In [16]:
# It appears that certain teams do not make up a large percentage of the highest rated games. Sony and FromSoftware 
# together only account for roughly 9.4% of all of the games rated above a 4.0.
# Genre seems to be a better indicator of game ratings.

filtered_df = data[(data["Rating"] > 4.0) & ((data["Team"].str.contains("FromSoftware")) | (data["Team"].str.contains("Sony Interactive Entertainment")))]
percentage = (len(filtered_df) / len(data[data["Rating"] > 4.0])) * 100
percentage

9.409190371991247

In [21]:
# I want to perform a sentiment analysis on the reviews of these games using NLP techniques to determine what 
# games seem to have positive sentiments vs negative sentiments to look for overall trends in the data.

sia = SentimentIntensityAnalyzer()

# Create a copy of the dataset
df = data.copy()

# Create a smaller verson of the data set with relevant columns
df = df[['Title', 'Release Date', 'Team', 'Rating', 'Genres', 'Reviews']]

df['Sentiment_Score'] = df['Reviews'].apply(lambda x: sia.polarity_scores(x)['compound'])

df.sort_values(by = 'Sentiment_Score', ascending = False)

Unnamed: 0,Title,Release Date,Team,Rating,Genres,Reviews,Sentiment_Score
992,Super Mario World 2: Yoshi's Island,"Aug 05, 1995","['Nintendo EAD', 'Nintendo']",4.1,['Platform'],"['Dinossauro', ""Beautiful hand-drawn-esque spr...",0.9997
610,Marvel Snap,"Oct 18, 2022","['Second Dinner', 'nuverse']",3.7,"['Card & Board Game', 'Strategy']","[""Starts out surprisingly fun and addictive at...",0.9997
1473,Shantae and the Seven Sirens,"Mar 27, 2020","['WayForward Technologies', 'WayForward']",3.5,"['Adventure', 'Platform']","['And with this entry, I think my interest in ...",0.9994
1406,Shin Megami Tensei: Devil Survivor,"Jan 15, 2009","['Atlus', 'Ghostlight Ltd.']",3.9,"['RPG', 'Strategy', 'Tactical', 'Visual Novel']","[""Oh fuck I forgot about this game. Played it ...",0.9993
674,Lunistice,"Nov 10, 2022","['A Grumpy Fox', 'Deck13 Interactive']",3.6,"['Adventure', 'Indie', 'Platform']","['Has some hiccups, but honestly super fun and...",0.9992
...,...,...,...,...,...,...,...
867,Dark Souls II: Scholar of the First Sin,"Apr 02, 2015","['Bandai Namco Entertainment', 'FromSoftware']",3.5,"['Adventure', 'RPG']","['Kinosoge', ""I hate this star-wars prequel es...",-0.9907
1243,Xenoblade Chronicles: Future Connected,"May 29, 2020","['Nintendo', 'Monolith Soft']",3.5,['RPG'],['I forget about this game until I randomly re...,-0.9933
530,The Evil Within,"Oct 14, 2014","['Tango Gameworks', 'Bethesda Softworks']",3.4,"['Adventure', 'Shooter']",['Tries to capture the same magic that RE4 has...,-0.9941
1141,Resident Evil,"Mar 22, 1996","['Capcom', 'Capcom Planning Room 2']",3.7,"['Adventure', 'Puzzle']","[""(Played before 2023)\n \...",-0.9945


In [18]:
# The review that was has the lowest sentiment score, seems to be quite accurate given the negative language
# used to describe the game

df.loc[1141]['Reviews']

'["(Played before 2023)\\n                     \\n                     This game did not age well. It is a product of its time and a lot of bad parts about it are seen as charming like the voice acting, but the voice acting is still, well, bad. Like a cheesey horror film, the story is dumb and stupid, and there is a lot of goofy moments (You were almost a Jill sandwich!). The design of the game is really good though, dealing with zombies in terms of trying to avoid them and risk getting bit, or clearing the threat with ammo and hoping you don\'t need it later. But overall there isn\'t anything special going on with it outside of the historical importance, so if you\'re not a fan of the series, especially the remake, just don\'t play it.", "REmake 4 prep still. gonna be cutting it really close with trying to fit all of this shit in but i\'m determined.", \'Resident Evil 1 is certainly a product of its time. It controls like molasses and its tank controls don’t help much. Since I played 

In [19]:
# The review that was has the highest sentiment score, seems to be quite accurate given the positive language
# used to describe the game

df.loc[992]['Reviews']

'[\'Dinossauro\', "Beautiful hand-drawn-esque sprites, the most jovial and joyful music you\'ve ever heard in a game, and engaging levels with unique mechanics. This game is great. It can skew a bit on the easy end but that\'s not a bad thing, plus collecting the extras and going for 100 Points is a nice challenge. This game holds a special place to me and I have deep nostalgia for it, but it\'s also just a solidly designed platformer.", "Super Mario Land 2: Yoshi\'s Island is one of the best platformers Nintendo has ever made. The game controls great in a way that even the likes of its predecessor, Super Mario World, does not. Every part of Yoshi’s moveset just feels natural. The flutter jump is great to adjust your jumps, the egg shooting is super responsive, and the ground pound is so iconic that even Mario stole it for later entries in his series. The level design compliments these controls well, as the exploratory nature of many of its levels put these skills to perfect use. To be

In [20]:
# Looking at the coorelation between Sentiment Scores to Genres and Ratings.
# There is a decently positive coorelation between Sentiment Scores and Ratings, as well as some of the top genres. 
# This makes sense, as the higher a sentiment score, the higher the rating of a game sould be.

# We would expect to see similar genres from before showing up again since they make up a large
# percentage of the top rated games. Adventure and RPG are towards the top again. However, the coorelations to the 
# genres are quite low, so this may not be the most useful metric to look at.

# Remove brackets and extra whitespace from genres
df["Genres"] = df["Genres"].str.replace(r"\[|\]", "").str.strip()

# Convert genres into binary columns using one-hot encoding
genres = df["Genres"].str.get_dummies(sep=", ")

# Concatenate the one-hot encoded genres with the original DataFrame
df_encoded = pd.concat([df, genres], axis=1)

# Calculate the correlation matrix
correlation_matrix = df_encoded.corr()

# Extract the correlation values for the "Sentiment" column
sentiment_correlation = correlation_matrix["Sentiment_Score"]

# Sort the correlation values in descending order
sorted_correlation = sentiment_correlation.drop("Sentiment_Score").sort_values(ascending=False)

# Print the top correlations between genres and ratings
sorted_correlation.head()

  df["Genres"] = df["Genres"].str.replace(r"\[|\]", "").str.strip()


Rating                   0.168689
'RPG'                    0.106693
'Platform'               0.079848
'Turn Based Strategy'    0.048207
'Adventure'              0.044182
Name: Sentiment_Score, dtype: float64

In [8]:
# Creating a time series analysis to further look at the growth of the Adventure genre over time.
# The results seem to support the growing popularity of adventure games.

# Convert the 'Release Date' column to datetime format
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')

# Filter the DataFrame to include only games with the 'Adventure' genre
adventure_games = df[df['Genres'].str.contains('Adventure')]

# Filter the adventure games up to the current year
adventure_games = adventure_games[adventure_games['Release Date'].dt.year <= 2022]

# Group the data by year and count the number of games
adventure_counts = adventure_games.groupby(adventure_games['Release Date'].dt.year).size().reset_index()
adventure_counts.columns = ['Year', 'Count']

# Create a Plotly figure with a dark background
fig = go.Figure()
fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',
    paper_bgcolor='rgb(17, 17, 17)',
    font_color='white'
)

# Add the time series line plot
fig.add_trace(go.Scatter(
    x=adventure_counts['Year'],
    y=adventure_counts['Count'],
    mode='lines+markers',
    marker=dict(color='rgb(148, 0, 211)'),  # Purple marker color
    line=dict(color='rgb(148, 0, 211)'),  # Purple line color
    name='# Adventure Games'
))

# Set axis labels and title
fig.update_xaxes(title_text='Year', tickfont=dict(color='white'))
fig.update_yaxes(title_text='Count', tickfont=dict(color='white'))
fig.update_layout(title='# Adventure Games over Time', title_font_color='white')

# Display the plot
fig.show()

In [7]:
# The Time Seires graph below shows the average ratings of games over time.
# I wanted to look at how this metric fluctuated over time, as well as find the year with the highest average ratings.

# Convert the "Release Date" column to datetime format
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')

# Extract the year from the "Release Date" column
df['Year'] = df['Release Date'].dt.year

# Group the data by year and calculate the average rating
average_ratings = df.groupby('Year')['Rating'].mean().reset_index()

# Create a Plotly figure with a dark background
fig = go.Figure()
fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',
    paper_bgcolor='rgb(17, 17, 17)',
    font_color='white'
)

# Add the time series line plot with a purple line
fig.add_trace(go.Scatter(
    x=average_ratings['Year'],
    y=average_ratings['Rating'],
    mode='lines+markers',
    marker=dict(color='rgb(148, 0, 211)'),  # Purple marker color
    line=dict(color='rgb(148, 0, 211)'),  # Purple line color
    name='Average Rating'
))

# Set axis labels and title
fig.update_xaxes(title_text='Year', tickfont=dict(color='white'))
fig.update_yaxes(title_text='Average Rating', tickfont=dict(color='white'))
fig.update_layout(title='Average Ratings of Games Over Time', title_font_color='white')

# Display the plot
fig.show()

In [23]:
# Convert the "Release Date" column to datetime format
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')

# Extract the year from the "Release Date" column
df['Year'] = df['Release Date'].dt.year

# Group the data by year and calculate the average rating
average_ratings = df.groupby('Year')['Rating'].mean().reset_index()

# Find the year with the highest average rating
max_average_rating_year = average_ratings.loc[average_ratings['Rating'].idxmax(), 'Year']
max_average_rating = average_ratings['Rating'].max()

print("Year with the highest average rating:", round(max_average_rating_year))
print("Highest average rating:", round(max_average_rating,2))

Year with the highest average rating: 1998
Highest average rating: 4.11


In [71]:
# Now lets create a genre classification model using the variables that we have to predict the rating a game will 
# recieve. First, we need to make sure that our data is the correct type. Clearly, the numeric columns are not
# numeric, so let's change that.

data.dtypes[0:12]

Title                 object
Release Date          object
Team                  object
Rating               float64
Times Listed          object
Number of Reviews     object
Genres                object
Summary               object
Reviews               object
Plays                 object
Playing               object
Backlogs              object
dtype: object

In [73]:
# Select the columns to convert and remove the 'K'
columns_to_convert = ['Times Listed', 'Number of Reviews', 'Plays', 'Playing', 'Backlogs', 'Wishlist']
data[columns_to_convert] = data[columns_to_convert].replace('K', '', regex=True)

# Multiply the values by 1000 and convert to float
data[columns_to_convert] = data[columns_to_convert].astype(float) * 1000

# Verify the updated DataFrame
data.head()

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3900.0,3900.0,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17000.0,3800.0,4600.0,4800.0
1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2900.0,2900.0,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21000.0,3200.0,6300.0,3600.0
2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4300.0,4300.0,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30000.0,2500.0,5000.0,2600.0
3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3500.0,3500.0,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28000.0,679000.0,4900.0,1800.0
4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3000.0,3000.0,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21000.0,2400.0,8300.0,2300.0


In [84]:
# Creating various machine learning models to test the accuracies of our models in different ways.

# Comparing Random Forest, Gradient Boosting, SVC, and Logstic Regression.

# As can be seen below, the RandomForestClassifier Accuracy is the best at .46. This indicates that 
# the best model predicted the correct genre for 46% of the instances in the test set. Lets see if we can make this 
# accuracy better.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

# Preprocess the data by splitting it into features (ratings and sentiment scores) and the target variable (genres)
X = data[['Times Listed', 'Number of Reviews', 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Rating']]
y = data['Genres']

# Replace missing values with the mean of the column
imputer = SimpleImputer(strategy='mean')
X[['Times Listed', 'Number of Reviews', 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Rating']] = imputer.fit_transform(X[['Times Listed', 'Number of Reviews', 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Rating']])

# Encode the target variable (genres) using LabelEncoder:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.1, random_state=42)

# Create and train different classifiers
classifiers = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(random_state=42),
    SVC(kernel='rbf', random_state=42),
    LogisticRegression(random_state=42)
]

for classifier in classifiers:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(type(classifier).__name__, "Accuracy:", round(accuracy, 2))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



RandomForestClassifier Accuracy: 0.46
GradientBoostingClassifier Accuracy: 0.43
SVC Accuracy: 0.12
LogisticRegression Accuracy: 0.07



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [86]:
# We improved the model accuracy here using Baggining ensemble techniques, lets see if Boosting is better or worse

from sklearn.ensemble import BaggingClassifier

# Create an ensemble of Random Forest classifiers using bagging
bagging_classifier = BaggingClassifier(RandomForestClassifier(n_estimators=100, random_state=42), 
                                      n_estimators=10, random_state=42)

# Train the bagging classifier
bagging_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Random Forest Classifier Accuracy:", round(accuracy, 2))

Bagging Random Forest Classifier Accuracy: 0.47


In [88]:
# This model accuracy is worse, so we can conclude that using Bagging with a Random Forest Clasifier
# is the best model.

from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting classifier
boosting_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the boosting classifier
boosting_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = boosting_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Gradient Boosting Random Forest Classifier Accuracy:", round(accuracy, 2))

Gradient Boosting Random Forest Classifier Accuracy: 0.43


In [None]:
# Lastly, I wanted to use ML and NLP through tensorflow and keras to train a model based off of the Reviews in
# the dataset to generate new game reviews, however my macbook did not have nearly enough processing capacity to get
# anything close to a decent model, as my kernal for this notebook would die anytime I tried to create a semi
# sophisticated model.

# Overall I'm really happy with eveything was able to look at throughout this project. Even what didn't go as planned
# still taught me a lot and this is definitely a project I'll remember for a long time. 