<center>
    <h1>🎮 Video Game Sales Analysis EDA, Visualizations</h1>
</center>
<!--  and Sales Prediction Using Machine Learning Models -->

## Import Necessary Packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# Visualization Packages Importing
from matplotlib import pyplot as plt
import seaborn as sns
from plotly import graph_objects as go
from plotly import express as px
# import plotly.plotly as py
from plotly.offline import init_notebook_mode,iplot

# WordCloud Packages
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

<img align="center" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRjU3MyvLT2oT-7mMElUQzxHfj8q7y2iompRg&usqp=CAU" alt="Video Game Salary" width="100%"/>

## Data Collection & Loading

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read Data Using read_csv() method
df = pd.read_csv('/kaggle/input/videogamesales/vgsales.csv')

Check some random data using `.sample()` method. It will pick the random number of records.

In [None]:
df.sample(5)

Get the first few records and display it. Let's take some time to overview some data and it's feature names

In [None]:
df.head()

Same Here, We can see the last few records for better understanding of datasets records.

In [None]:
df.tail()

Here, We can see the shape of the dataset that we are using here. `.shape` attribute will return the tuple with (number of rows and number of columns)

In [None]:
df.shape

Let's print the features names with the help of `.columns` attribute. It'll returning `pandas Index` type of list with `dtype` is `object`

In [None]:
df.columns

## Data Exploration and Analysis

<img align="center" src="https://i.chzbgr.com/full/8226604032/hD903110C/steam-sales-have-me-on-the-run" alt="Sales Statistics"
     width="100%" />

** Statistical information for all numerical features

In [None]:
df.describe()

** Overall information of datasets...

Using this `.info()` method we can see that, it will return the complete details about our dataset behavior. Like 
1. Total Number of records,
* Feature Names,
* Number of missing values,
* What's the datatype for all features, And
* It's memory usage

In [None]:
df.info()

* Now, We need to know how many features are having missing values, So we can easily find the feature name that has `NaN` values.

In [None]:
df.isna().any()

Here, we have only two features are having missing values, Names are `Year` and `Publisher`.

Firstly, We need to know how many values are missing? So Here, We use some calculations to show the percentage values of missing values features.  

In [None]:
(df.isna().sum() * 100) / df.shape[0]

We clearly see that, `Year` has `1.63%` of values are actually missing. And, `Publisher` has `3.4%` of values are missing.

### Separate the Numerical And Categorical Features


** Categorical Features are:

In [None]:
# Creating Categorical DataFrame
categorical_df = df.select_dtypes('O')

categorical_df.head()

** Numerical Features are:

In [None]:
# Create Numerical DataFrame

numerical_df = df.select_dtypes(('int', 'float'))

numerical_df.head()

Create a list of Categorical and Numerical Feature from `categorical_df` and `numerical_df`.

In [None]:
categorical_features = categorical_df.columns
print(categorical_features)

print('-' * 60)

numerical_features = numerical_df.columns
print(numerical_features)

#### Analysis for Categorical Features & check the most repetitive values.

In [None]:
# Categorical Features Normalization

for category_name in categorical_features:
    print('-' * 50)
    print("Column Name: ", category_name)
    print(' ' * 50)
    
    print(df[category_name].value_counts().head())
    
    print('-' * 50)
    print('-' * 50)

### Data Cleaning and Remove NaN values.

Initially, We need to check the `How many number of missing values are there.`

In [None]:
#  Remove Null Values

df.isna().sum()

* Here, One feature is numerical and another is categorical featue, So we use `include='all'` to show both feature descriptions

In [None]:
df[['Year', 'Publisher']].describe(include='all')

#### Fill Missing Value in year Feature

In [None]:
df.Year = df.Year.fillna(df.Year.mean())

In [None]:
# Change Year dtype to int32

df.Year = df.Year.astype('int32')
df.Year

#### Fill missing value in Publisher Feature

Here, `Publisher` is a categorical feature so we need to fill the missing values with values which is most repetitive.

So, We use the `.value_counts()` methods to get the value which is continue repeating.

In [None]:
df.Publisher.value_counts(normalize=True)

- To replace value we can use `mode` to get 'Electronic Arts'.

In [None]:
df.Publisher = df.Publisher.fillna(df.Publisher.mode()[0])

Now, Let's check the datatype for both features.

In [None]:
df[['Publisher','Year']].dtypes

## Data Visualization

<img align="center" src="https://media.giphy.com/media/1XeAoRH74h7i0MtwCU/giphy.gif" alt="Video Games Sales Visualization" width="100%" />

#### *Showing top 10 Publisher who has published many video games by viewing bar plots*

In [None]:
# Get Top 10 Video Games Publishers
top_10_publishers = df.Publisher.value_counts().head(10)

px.bar(top_10_publishers, title='Top 10 Video Game Pubishers', 
       labels={
           'value': "Number of Games Publishing",
           'index': "Name of the Publisher"
       })

#### *Showing top 10 Video Games Genres that has most playing video games using bar and scatter plots*

In [None]:
# Get Top 10 Video Games Genre
top_10_generes = df.Genre.value_counts()
# top_10_generes

fig =px.bar(top_10_generes, title='Top 10 Video Game Genres', 
       labels={
           'value': "Number of Games Genres",
           'index': "Name of the Genre"
       })

fig.show()


fig = px.scatter(top_10_generes, title='Top Gernres Games',
              labels={
                   'value': "Numbers",
                   'index': "Genre"
               })
fig.show()



# px.bar(top_10_generes.index, top_10_generes.values, title='Top 10 Video Game Genres', 
#        labels={
#            'value': "Numbers",
#            'index': "Genre"
#        })

#### *Showing top 10 Playing Video games Platforms using line plots*

In [None]:
# Get Top 10 Video Games Genre
top_10_platform = df.Platform.value_counts().sort_values()
top_10_platform

fig = px.line(top_10_platform, title='Top Playing Platforms',
              labels={
                   'value': "Counts",
                   'index': "Name of the Platform"
               })

# fig = go.Figure(data=go.Scatter(x= top_10_platform.index, y=top_10_platform.values,
#                                title="Top Playing Platforms"))

fig.show()

In [None]:
df.head()

<img align="center" src="https://media.giphy.com/media/idvY7ibAEvN9bh2rlV/giphy.gif" alt="Video Games Sales Visualization" width="100%"/>

#### *Showing Total Number of Sales (In Millions) for North America, Europe, Japan and Other Country's Sales by Year wise.*

In [None]:
year_wise_sales = df.loc[:, ['Name', 'Year', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].groupby(by =  'Year'  ).sum()


fig1 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['NA_Sales'],
                  name = "North America's Sales",
                  line_shape='vh'
                 )

fig2 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['EU_Sales'],
                  name = "Europe's Sales",
                  line_shape='vh')

fig3 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['JP_Sales'],
                  name = "Japan's Sales",
                  line_shape='vh')

fig4 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['Other_Sales'],
                  name = "Other Sales",
                  line_shape='vh')

figs = [ fig1, fig2, fig3, fig4 ]

layout = dict(title = 'Year Wise Total Game Sales of North America, Europe, Japan and Other Country',
              xaxis= dict(title= 'Year' ),
              yaxis= dict(title= 'Total Sales In Millions',)
             )

figure = dict(data = figs, layout = layout)

iplot(figure)

#### *Showing Average Sales (In Millions) By Year for Countries.*

In [None]:
year_wise_sales = df.loc[:, ['Name', 'Year', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].groupby(by =  'Year'  ).mean()


fig1 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['NA_Sales'],
                  name = "North America's Sales",
                  line_shape='vh'
                 )

fig2 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['EU_Sales'],
                  name = "Europe's Sales",
                  line_shape='vh')

fig3 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['JP_Sales'],
                  name = "Japan's Sales",
                  line_shape='vh')

fig4 = go.Scatter(x = year_wise_sales.index, y = year_wise_sales['Other_Sales'],
                  name = "Other Sales",
                  line_shape='vh')

figs = [ fig1, fig2, fig3, fig4 ]

layout = dict(title = 'Year Wise Average Sales for North America, Europe, Japan and Other Country',
              xaxis= dict(title= 'Year' ),
              yaxis= dict(title= 'Average Sales In Millions',)
             )

figure = dict(data = figs, layout = layout)

iplot(figure)

#### *Showing Year wise Overall Global Sales (In Millions) By Genres With Name of the Game using Scatter Plot.*

In [None]:
# Scatter 

fig = px.scatter(df, x="Year", y="Global_Sales", color="Genre",
                 size='Global_Sales', hover_data=['Name'],
                 title="Year Wise Global Video Game Sales by Genere",
                 labels={'x':'Years', 'y':'Global Sales In Millions'}
                )

fig.show()

#### *Top Ten Video Games Sales by Genre, Publisher and Platforms For All Country using sunburst graph.*

In [None]:
top_sales = df.sort_values(by=['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales'], ascending=False).head(10)

# ['NA_Sales', '', '', '']
dicts_name = {
    'NA_Sales' : "North America Sales ( In Millions)",
    'EU_Sales' : "Europe Sales ( In Millions)",
    'JP_Sales' : "Japan Sales ( In Millions)",
    'Other_Sales' : "Other Sales ( In Millions)",
}

for (key, title) in dicts_name.items():
    
    fig = px.sunburst(top_sales, path=['Genre', 'Publisher', 'Platform'], values=key, title= 'Top Selling by '+ title)
    
    fig.update_layout(
        grid= dict(columns=2, rows=2),
        margin = dict(t=40, l=2, r=2, b=5)
    )

    fig.show()

#### *Showing Most repeting word in the dataset for all Categorical values like 'Name', 'Publisher', 'Platform' and 'Genre'.*

In [None]:
global_sales = df.sort_values(by='Other_Sales', ascending=False)

# plt.subplot(1, 2, 1)


fig = plt.figure(figsize=(17,17))


for index, col,  in enumerate(categorical_features):
    
    plt.subplot(len(categorical_features), 2, index + 1)
    
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(
        stopwords=stopwords
    ).generate(" ".join(global_sales[col]))

    # Show WordCloud Image
    
    
    plt.imshow(wordcloud)
    plt.title("Video Game " + col, fontsize=20)
    plt.axis('off')
    plt.tight_layout(pad=3)

plt.show()

#### *Displaying the correlation for the numerical feature.*

In [None]:
corr_ = df.corr()

plt.figure(figsize=(12, 7))

sns.heatmap(corr_, annot=True, linewidths=.2, cmap='RdYlBu_r')

plt.show()

In [None]:
df.head(5)

### Implementing LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
data = df.copy()

le = LabelEncoder()

In [None]:
feature = ["Platform", "Genre"]

for col in feature:
    data[col] = le.fit_transform(df[col])    

Let's create train and target feature for train and test splites

In [None]:
X = data[['Platform', 'Genre', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].values

y = data['Global_Sales'].values

In [None]:
X[:5], y[:5]

### Splite the data into Train and Test set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=45)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

## Model Selection

In [None]:
from sklearn.linear_model import LinearRegression

# Import r2 score for Calculation
from sklearn.metrics import r2_score

In [None]:
lr = LinearRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

r2_MultiLinear = r2_score(y_test,pred)

In [None]:
print(r2_MultiLinear)

print(lr.score(X_test, y_test))

### Implementing KNeighbor

In [None]:
#  

from sklearn.neighbors import KNeighborsRegressor

In [None]:
kRange = range(1,15,1)

scores_list = []
for i in kRange:
    regressor_knn = KNeighborsRegressor(n_neighbors = i)
    
    regressor_knn.fit(X_train,y_train)
    pred = regressor_knn.predict(X_test)
    
    scores_list.append(r2_score(y_test,pred))

In [None]:
plt.plot(kRange, scores_list, linewidth=2, color='blue')
plt.xticks(kRange)

plt.xlabel('Neighbor Number')
plt.ylabel('r2_Score of KNN')
plt.show()   

In [None]:
# Training the KNN model on the training set
regressor_knn = KNeighborsRegressor(n_neighbors = 3)

regressor_knn.fit(X_train,y_train)
pred = regressor_knn.predict(X_test)

r2_knn = r2_score(y_test,pred)
print(r2_knn)

### Implementing Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(random_state=32)

In [None]:
dtr.fit(X_train, y_train)

pred = dtr.predict(X_test)

print(r2_score(y_test, pred))

### Implementing RandomForest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state= 10)

In [None]:
rfr.fit(X_train, y_train)

pred = rfr.predict(X_test)

print(r2_score(y_test, pred))

### Implementing SVM

In [None]:
from sklearn.svm import SVR

svr_linear = SVR(kernel='linear')

svr_rbf = SVR(kernel='rbf')

In [None]:
svr_linear.fit(X_train, y_train)
svr_rbf.fit(X_train, y_train)

pred_linear = svr_linear.predict(X_test)
pred_rbf = svr_rbf.predict(X_test)

print(r2_score(y_test, pred_linear))
print(r2_score(y_test, pred_rbf))

### Implementing XGBoost

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()

In [None]:
xgb.fit(X_train, y_train)

pred = xgb.predict(X_test)

print(r2_score(y_test, pred))

## Applying HyperParams Tunning 

In [None]:
# DecisionTree Tunning

<img align="center" src="https://i.pinimg.com/originals/69/cb/61/69cb61ef329d954713fea8560892e505.gif" alt="Thanks for Visiting"
     width="100%"/>

--- 
---

<div class="text-center">
    <h1>That's it Guys,</h1>
    <h1>🙏</h1>
    
        
        I Hope you guys you like and enjoy it, and learn something interesting things from this notebook, 
        
        Even I learn a lots of things while I'm creating this notebook
    
        Keep Learning,
        Regards,
        Vikas Ukani.
    
</div>

---
---

<img src="https://static.wixstatic.com/media/3592ed_5453a1ea302b4c4588413007ac4fcb93~mv2.gif" align="center" alt="Thank You" style="min-height:20%; max-height:20%" width="90%" />

