## 1.Data Understanding

Lets start by importing Pandas Library

In [None]:
import pandas as pd

In [None]:
path_original="/kaggle/input/videogamesales/vgsales.csv" 

Then we define the path of the csv data file, create a dataframe **(_df_)** excluding the headers of the data file. 

In [None]:
df=pd.read_csv(path_original, header = None)

Lets have a quick look at the first 5 rows of the data by calling **_.head()_** function

In [None]:
df.head()

As we can see, our dataframe now have bıth rows and columns indices. We will tidy up the dataframe a little bit, in coming sections. Now lets hae a look at the datatypes of the columns by calling **_.info()_** function

In [None]:
df.info()

As we can see, all columns' datatype seem like **_object_** which is basically string. But we will change that because we will need some columns to be **_integer_** or **_float_** to be able tu run some graphs. We will come back to that later too. Lets first fix column names and reset row indices. 

In [None]:
df.columns = df.iloc[0,]
df.drop([0],axis=0,inplace=True)
df.reset_index(inplace=True, drop=True)
df

Lets check the modified dataframe again. It will look better and more tidy.

In [None]:
df.head()

Lets use another dataftrame tool **_.describe()_** to have more details. 

In [None]:
df.describe(include="all")

Normally **_.describe()_** function gives the max, min and mean values of the columns, together with some more numeric data. We cannot see those details before we change the datatypes of columns. We will do in coming sections. Lets save the dataframe to another csv file for keeping our work safe. 

## 2.Data Wrangling

After having a basic understanding of the dataset, lets start shaping the dataset by checking null values.

In [None]:
missing_val=df.isnull()
missing_val.value_counts()

'Year' and 'Publisher' columns have null values. We cant have the graphs that we want to have with empty 'Year' and 'Pubnlisher' info. We also cant calculate ant mean value or subsitute for such data. In addition to that, considering the small percentage of missing values, we will drop those rows. 

In [None]:
df.dropna(subset=['Year'], axis=0, inplace = True)  #Deleting rows where 'Year' data is null.

In [None]:
df.dropna(subset=['Publisher'], axis=0, inplace=True) #Deleting rows where 'Publisher' is null.

Lets reset the index

In [None]:
df.reset_index(drop=True, inplace=True)

Lets change the dataypes as required. We need some of the columns like sales numbers to be float.

In [None]:
df.describe()

In [None]:
df[["Rank"]]= df[["Rank"]].astype("int")
df[["Year"]]= df[["Year"]].astype("int")
df[["NA_Sales"]]= df[["NA_Sales"]].astype("float")
df[["EU_Sales"]]= df[["EU_Sales"]].astype("float")
df[["JP_Sales"]]= df[["JP_Sales"]].astype("float")
df[["Other_Sales"]]= df[["Other_Sales"]].astype("float")
df[["Global_Sales"]]= df[["Global_Sales"]].astype("float")

Lets check the new datatypes and use **_.describe()_** function to access additional information of numerical columns.

In [None]:
df.dtypes

Now we corrected the datatypes and we can get more detailed information about int of float columns. Please notice that we have numerical values like man, min, max

In [None]:
df.describe(include="all")

Lets have a quick look to genres before starting creating graphs. We will also have a look ad the first 5 rows with **_.head()_** function 

In [None]:
df['Genre'].unique()

In [None]:
df['Genre'].value_counts().to_frame()

In [None]:
df.head()

In [None]:
df.describe(include='all')

It can be a good practice, saving your shaped dataframe to another url after wrangling complete.

## 3.Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
import matplotlib as mpl

mpl.style.use('ggplot')  # optional: for ggplot-like style

First lets start with the very basic analysis(graph) to see which genres are in demand in general. For this purpose, we will generate another dataframe, calculating total global sales numbers grouped by genre.

In [None]:
df_grp_genre=df[["Genre","Global_Sales"]]
grp_genre=df_grp_genre.groupby("Genre", axis=0, as_index=True)

In [None]:
grp_genre_total=grp_genre.sum()
grp_genre_total.sort_values(by='Global_Sales',inplace=True, ascending=False)
grp_genre_total

Lets create a pie-chart to visualise.

In [None]:
#Lets create a pie plot of the same sub-dataframe. 

grp_genre_total['Global_Sales'].plot( kind='pie',
                                    figsize=(14,8),
                                    autopct = '%1.1f%%', #label format
                                    startangle=90, #the start angle of first item , "Africa"
                                    shadow = True,
                                    labels=None,
                                    pctdistance=1.12)

plt.title('Total Game Sales by Genre during last 40 years',y=1.12)
plt.axis('equal')
plt.legend(labels=grp_genre_total.index, loc='upper left')

**Comment**: We can see from the pie chart that between 1980 and 2020, biggest sales figures belong to Action, Sports and Shooter games respectively. 

Next, we would like to have a look at total global sales figures of all genres to have an idea of growth of the video games market. Lets create a sub-dataframe of Global Game Sales by year and group it by year.

In [None]:
df_grp_years=df[["Year","Global_Sales"]]
grp_years_total = df_grp_years.groupby("Year", axis=0, as_index=True).sum()
grp_years_total.sort_values(by='Year', inplace=True)
grp_years_total

Notice that sales figures in 2017 and 2020 is a little odd. Having a little domain knowledge or with a little search, we can realize that those information might probably be wrong. So lets drop them too. 

In [None]:
grp_years_total.drop(axis=0,index=[2017,2020], inplace=True)

In [None]:
grp_years_total

Lets create an area plot of Global Sales by year. 

In [None]:
grp_years_total.plot(kind='area',
                    figsize=(14,8),
                    alpha=0.4)

**Comment:** As you can se from the figure , Global sales of video games peaked in 2008 and 2009. Lets see which games were released in those years. 

In [None]:
df_2008_filter=df['Year']==2008
df_2009_filter=df['Year']==2009
df_2008_2009 = df[df_2008_filter|df_2009_filter].sort_values(by='Global_Sales', ascending=False)
df_2008_2009.head()

**Comment:** As you can see, Wii's dominancy and the combination of sports game with a classic game like "Mario" made the market peaked in those years.

Lets see more details of sales of Genres by time. For the ease of reading, we will only consider first four categories which are: Action, Sports, Shooting and Role-Playing.

In [None]:
genre_group = ['Action','Sports','Shooter','Role-Playing']
df_grp_years_genre_filter = df['Genre'].isin(genre_group)

In [None]:
df_grp_years_genre = df[df_grp_years_genre_filter]
df_grp_years_genre = df_grp_years_genre[['Year','Genre','Global_Sales']]

In [None]:
#df_grp_years_genre.sort_values(by='Year', inplace=True)

In [None]:
df_grp_years_genre.describe()

In [None]:
filter1=df_grp_years_genre['Genre']=='Sports'
filter2=df_grp_years_genre['Year']==1988
df_grp_years_genre['Global_Sales'][(filter1) & (filter2)].sum()  #Filter test

Lets create a new dataframe, having genres as columns and global sales values as values by year.

In [None]:
genre_group = ['Action','Sports','Shooter','Role-Playing']
years_serie=np.arange(1980,2018,1)
genre_group

In [None]:
#Here I create a new dataframe, having genre names as columns. This is for plotting purposes. Remember our original dataset dont have seperate columns for genre. 
dfx = pd.DataFrame({'Year': years_serie})
for genre in genre_group:
    dfx[genre]=0
dfx.tail()

In [None]:
#Assigning Year column as index column.
dfx.index=dfx['Year']
dfx.drop('Year', axis=1,inplace=True)

In [None]:
#Here i fill my new dataframe's values from previous dataset.
for x in years_serie:
    for genre in genre_group:
        filter1=df_grp_years_genre['Genre']==genre
        filter2=df_grp_years_genre['Year']==x
        total_val=df_grp_years_genre['Global_Sales'][(filter1) & (filter2)].sum()
        dfx.loc[x,genre] = total_val
dfx.head()

Lets create an area plot to visually examine the sales of different genres by year.

In [None]:
dfx.plot(kind='area',
             stacked=False,
             alpha=0.4,
             figsize=(20, 10),
            title='Console Game Global Sales by Genre') 

**Comment**: We can see that the increase of sales of those top genres are parallel to each other. Action and Sports games are leading the market followed by shooter and role-playing games. We can also see that the role-playing games have more consistent fans because the sales numbers (_starting from 2000_) dont change dramatically. It has a stable increase rate.; Unlike the role-playing games, sports games may seem to be more of "short term trend" which picked in 2005 and 2010 but dropped almost 65% (_from 140M to 40M_) from 2010 to 2012.

Another interesting point is the start of the rally in 1995. This clearly indicates the first release of Sony Playstation (_end 1994_) and Nintendo 64 (_mid 1996_). Video game market started to change in those years and never was the same. See the radical change of the industry, increasing the total market value from 80M to 680M. (_figure 1_)

**Option**: In addition to this area map, similar graph can be plotted for platforms' performance over the year. We will instead, take a closer look at PSP (end date : December 2014) and PSV (end date : April 27, 2021) and we will try to understand the supporting numbers behind the desicion of ending their lifetime.

Now lets create a heatmap of platform vs. genre with the values of total global sales numbers. For this purpose we will create another dataframe (_grp_pivot_)

In [None]:
df_grp_console_genre = df[['Platform','Genre','Global_Sales']]
df_grp_console_genre_total = df_grp_console_genre.groupby(['Platform','Genre'], as_index = False).sum()
grp_pivot = df_grp_console_genre_total.pivot(index='Platform', columns='Genre')
grp_pivot.fillna(0, inplace=True) #fills not available values with 0
grp_pivot.head()

Lets create the heatmap.

In [None]:
fig, ax = plt.subplots()
im = ax.pcolor(grp_pivot, cmap='RdBu')
fig.set_size_inches(15,15)
#label names

row_labels = grp_pivot.columns.levels[1]
col_labels = grp_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grp_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grp_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
ax.set_title("Total numer of game sales between 1980-2017, (M USD)")

# Loop over data dimensions and create text annotations.
for i in range(len(grp_pivot.index)):
    for j in range(len(grp_pivot.columns)):
        text = ax.text(j, i, int(grp_pivot.iloc[i,j]),
                        va="bottom",
                        ha="left",
                        color="white",
                        size=15)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

**Comment:Which platforms are better in certain genres?** Now we have a clearer picture of this question: While PS, Xbox360 and Wii are dominating the most selling genres like action, sports and shooter; DS seem to be performing average  in more genres like action, misc, platform, puzzle, role-playing ans simulation (_seems like a reasonable market strategy_). We can also see the underperforming (_or non existing_) platforms like 3DO,GG,NG,PCFX,WS etc. The **most** selling platform-genre combiantion is **PS3 & Action** with sales number of **304M USD**.

Now I would like to go into more details of Sony game consoles. As a handheld console fan, I always wandered why Sony' handheld consoles come to a deadline. To find some clues, we will create a new dataframe and draw another area plot.

In [None]:
ps_platform_group = ['PS','PS2','PS3','PS4','PSP','PSV']
years_serie=np.arange(1980,2018,1)

df_sony = pd.DataFrame({'Year': years_serie})
for platform in ps_platform_group:
    df_sony[platform]=0
df_sony.set_index('Year', inplace=True)
df_sony.tail() #Sony started PS legend in 1995. Thats why we may not see the sales figures in df_sony.head()

In [None]:
#Lets fill in Global_Sales numbers in our dataframe
for x in years_serie:
    for platform in ps_platform_group:
        filter1=df['Platform']==platform
        filter2=df['Year']==x
        total_val=df['Global_Sales'][(filter1) & (filter2)].sum()
        df_sony.loc[x,platform] = total_val
df_sony.tail() #you can only see zeros if you look at the head of the dataframe. Because first Sony PS was released in 1995.

In [None]:
df_sony.plot(kind='area',
             stacked=False,
             alpha=0.8,
             figsize=(20, 10),
            title="Global Sales Figures of Sony Game Consoles") 

**Comment**:So, our visual makes it obvious why PSP and its heir PSV didnt last. Compared to success of TV consoles of Sony, its handheld consoles could only hit 25% as a peak number and even less in terms of total sales by years. My opinion is that these numbers and data analysises such this one should have had given to Sony (and to other competitors) an indication weather or not they should push handheld console market, especially after 2015

**Geographical Numbers**: In our last graph, we will define the geographical effects on sales of spesific genre and platform. We will create a joined graph. First lets create our dataframes. 

In [None]:
df_pg=df[['Platform','NA_Sales','EU_Sales','JP_Sales','Global_Sales']].groupby('Platform').sum() #pg stands for platform, geography
df_pg.sort_values('Global_Sales', ascending=False, inplace=True)
df_pg.head()

In [None]:
df_gg=df[['Genre','NA_Sales','EU_Sales','JP_Sales','Global_Sales']].groupby('Genre').sum() #gg stands for genre, geography
df_gg.sort_values('Global_Sales', ascending=False, inplace=True)
df_gg.head()

For the sake of having a easier view, we will use first 5 rows, by global sales numbers.

In [None]:
fig = plt.figure() # create figure

ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot). See tip below**

# Subplot 1: Box plot
df_pg.head().plot(kind='bar',  figsize=(20, 6), ax=ax0) # add to subplot 1
ax0.set_title('Total Number of Game Sales by Platform & Geography')
ax0.set_xlabel('Console')
ax0.set_ylabel('M USD')

# Subplot 2: Line plot
df_gg.head().plot(kind='bar', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Total Number of Game Sales by Genre & Geography')
ax1.set_xlabel('Genre')
ax1.set_ylabel('M USD')

plt.show()

**Comment on Consoles' Sales**: First thing to notice on th left plot is that North America region is the biggeest market for consoles, while Japan is the smallest. While XBOX 360 leading the North America market, Sony's PS2 and PS3 is dominant in Europe. Japan seems to have had more handheld console fans than TV consoles. 

**Comment on Genres' Sales**: It is interesting to notice that, on the right hand side, almost all genres' sales figures move in paralell. But there is a different story with role-playing games in Japan. While other regions have almost 50% sales from action games, it is almost 40% role-playing games.

**Option**: We can also heatmaps for the similar analysis above.

_Please note that this study's purpose is solely practice Pandas and Numpy libraries and the it only represents my own ideas as a video game enthusiastic. It does not include any kind of actual insight information from any of the companies mentioned above. Comments are fully sponteneus