#                               Introduction




In [1]:
#Importing libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

#Importing dataset
df= pd.read_csv('./Data/vgsales.csv')
df.head()


The modules are imported


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [2]:
df.shape

(16598, 11)

* Using df.shape function, we see that our dataset has 16598 rows and 11 columns.

In [3]:
df.isnull().sum() 

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

we will look for missing values using the df.isnull().sum() function. We observe that the column Year has 271 missing values and the column Publisher has 58 missing values

# 1. Which region has performed the best in terms of sales?
We will utilize the average sales made per region and compare the results. Before we do that, let's make sure we know how to calculate the average. We will be using a very simple line of code for this, i.e., df['Region'].mean(), where 'region' depicts the different regions in the dataset and mean() function is used to calculate the mean. We observe that our output is coming in decimals, to convert the values in millions, let's multiple the result with 10,00,000. The final code should look like this.

In [4]:
x=(df['NA_Sales'].mean()*1000000)
y=(df['EU_Sales'].mean()*1000000)
z=(df['JP_Sales'].mean()*1000000)
q=(df['Other_Sales'].mean()*1000000)
p=(df['Global_Sales'].mean()*1000000)

print("The average sales in North America =", (f"${x:,.3f}")) #comma separated values till 3 decimal place and $ sign
print("The average sales in Europe =",(f"${y:,.3f}"))
print("The average sales in Japan =",(f"${z:,.3f}"))
print("The average sales in other regions =",(f"${q:,.3f}"))
print("The average sales globally =",(f"${p:,.3f}"))

The average sales in North America = $264,667.430
The average sales in Europe = $146,652.006
The average sales in Japan = $77,781.660
The average sales in other regions = $48,063.020
The average sales globally = $537,440.656


In [5]:
colors = ['lightslategray',] * 4
colors[1]='darkgray'
colors[2]='grey'
colors[3]='dimgrey'
colors[0] = 'crimson'

bar1 = go.Figure(data=[go.Bar(
    y=['Global','North America', 'Europe', 'Japan',
       'Other'],
    x=[537440.656,264667.430, 146652.006, 77781.660, 48063.020],
    orientation='h',
    marker_color=colors # marker color can be a single color value or an iterable
)])
bar1.update_layout(title_text='Region with highest sales on an average')
bar1.update_xaxes(title='Average Sales')
bar1.update_yaxes(title='Regions')

We have successfully used a bar graph to demonstrate the results. It is clear now which region has made the most sales (expect global sales) in terms of video game. We observe that **North America** has the highest average sales of **$264,667.430**. It's clear now that we can use the bar graph to answer questions like, which country sells the most video games in the world? This will be help us make some strategy changes.

The above result also answers our 8th questions, i.e., **Is there any region that has out-performed global average sales?**

The answer is **'No'**. There is no region that has out-performed the global average sales. The average global sales is **$537,440.656 million.**

# 2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor.



The world is getting more connected every day, and more gaming trends and preferences are emerging around the globe. 


We will use the df.groupby function of pandas for our analysis.

**groupby(): This function is used to group or combine large amounts of data and compute operations on these groups.**

* The dataframe **data** we created will provide us with the total sales (global, NA, EU, JP, Others) based on different platforms. The dataframe will look like this:

In [6]:
#Grouping the north america sales based on each platform
data2 = pd.DataFrame(df.groupby("Platform")[["NA_Sales"]].sum().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
data2.rename(columns = {'Platform':'Platform_NA'}, inplace = True)

#Grouping the europe sales based on each platform
data3 = pd.DataFrame(df.groupby("Platform")[["EU_Sales"]].sum().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
data3.rename(columns = {'Platform':'Platform_EU'}, inplace = True)

#Grouping the japan sales based on each platform
data4 = pd.DataFrame(df.groupby("Platform")[["JP_Sales"]].sum().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
data4.rename(columns = {'Platform':'Platform_JP'}, inplace = True)

#Grouping the other region sales based on each platform
data5 = pd.DataFrame(df.groupby("Platform")[["Other_Sales"]].sum().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
data5.rename(columns = {'Platform':'Platform_other'}, inplace = True)

#Concatenating our datasets
data=pd.concat([data2,data3,data4,data5],axis=1)
data.head(3)

Unnamed: 0,Platform_NA,NA_Sales,Platform_EU,EU_Sales,Platform_JP,JP_Sales,Platform_other,Other_Sales
0,X360,601.05,PS3,343.71,DS,175.57,PS2,193.44
1,PS2,583.84,PS2,339.29,PS,139.82,PS3,141.93
2,Wii,507.71,X360,280.58,PS2,139.2,X360,85.54


We realize that the dataframe alone will make it very hard for us to analyze the result. Therefore, we plot the result using a line chart.

In [7]:
from plotly.subplots import make_subplots #import new library

subplot1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top platforms","Europe top platforms","Japan top platforms","Other regions top platforms"))

#Subplot for North America
subplot1.add_trace(go.Bar(x=data['Platform_NA'], y=data['NA_Sales'],
                    marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot1.add_trace(go.Bar(x=data['Platform_EU'], y=data['EU_Sales'],
                    marker=dict(color=[4, 5, 6], coloraxis="coloraxis")),                         2, 1)
                   
#Subplot for Japan
subplot1.add_trace(go.Bar(x=data['Platform_JP'], y=data['JP_Sales'],
                    marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),
                    3, 1)

##Subplot for Other Regions
subplot1.add_trace(go.Bar(x=data['Platform_other'], y=data['Other_Sales'],
                    marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),
                   4, 1)
                   
subplot1.update_layout(height=900,width=500,coloraxis=dict(colorscale='Magenta'), showlegend=False)
subplot1.show()

The graph shows us the top platforms preferred by users in different regions. We observe the following:

* **X360** (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
* **PS3** (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
* **DS** (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
* **PS2** (Sony)  is the top preferred console by users in other regions making a total of $193.44 million.

* This implies that our assumption about Sony, Nintendo and Microsoft among the top consoles was correct.



# 3. What are the top 10 games currently making the most sales globally?
* We will use a similar approach by grouping the games with respect to the global sales and observe the top 10 games.

In [46]:
top = pd.DataFrame(df.groupby("Name")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index())
top.head(10) #Printing the top 10 results

Unnamed: 0,Name,Global_Sales
0,Wii Sports,82.74
1,Grand Theft Auto V,55.92
2,Super Mario Bros.,45.31
3,Tetris,35.84
4,Mario Kart Wii,35.82
5,Wii Sports Resort,33.0
6,Pokemon Red/Pokemon Blue,31.37
7,Call of Duty: Black Ops,31.03
8,Call of Duty: Modern Warfare 3,30.83
9,New Super Mario Bros.,30.01


We see the most played game is **Wii Sports** making a total of **$82.74 million** globally.


We will plot the above using a pie chart.

In [47]:
pie1 = px.pie(top, values=top['Global_Sales'][:10], names=top['Name'][:10],title='Top 10 games globally', 
              color_discrete_sequence=px.colors.sequential.Purp_r)
pie1.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)

pie1.show()

The pie chart also shows the proportion of sales each game holds, while also depicting the results.

# 4. What are the top games for different regions?

* We will have to compare the sales made by different games regionally. 

In [48]:
name2 = pd.DataFrame(df.groupby("Name")[["NA_Sales"]].mean().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
name2.rename(columns = {'Name':'Name_NA'}, inplace = True)

name3 = pd.DataFrame(df.groupby("Name")[["EU_Sales"]].mean().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
name3.rename(columns = {'Name':'Name_EU'}, inplace = True)

name4 = pd.DataFrame(df.groupby("Name")[["JP_Sales"]].mean().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
name4.rename(columns = {'Name':'Name_JP'}, inplace = True)

name5 = pd.DataFrame(df.groupby("Name")[["Other_Sales"]].mean().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
name5.rename(columns = {'Name':'Name_other'}, inplace = True)

#Concatenating the results.
name_df=pd.concat([name2,name3,name4,name5],axis=1)

In [49]:
subplot_name1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top games","Europe top games", "Japan top games","Other regions top games",'Top games globally'))

#Subplot for North America
subplot_name1.add_trace(go.Bar(x=name_df['Name_NA'][:5], y=name_df['NA_Sales'][:5],marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot_name1.add_trace(go.Bar(x=name_df['Name_EU'][:5], y=name_df['EU_Sales'][:5],marker=dict(color=[4, 5, 6], coloraxis="coloraxis")), 2, 1)

#Subplot for Japan
subplot_name1.add_trace(go.Bar(x=name_df['Name_JP'][:5], y=name_df['JP_Sales'][:5],marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),3, 1)

#Subplot for other regions
subplot_name1.add_trace(go.Bar(x=name_df['Name_other'][:5], y=name_df['Other_Sales'][:5],marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),4, 1)

subplot_name1.update_layout(height=1000,width=500,coloraxis=dict(colorscale='Mint_r'), showlegend=False)
subplot_name1.update_xaxes(tickangle=45)
subplot_name1.show()

The graph shows us the top games preferred by users in different regions and also globally. We observe the following:

* **Wii Sports** has been the top game in North America, Europe, other regions.
* **Pokemon Red/Pokemon Blue** is the top game in Japan.