# Project Description

Historical game sales data, user and expert ratings, genres and platforms (such as Xbox or PlayStation) are available from public sources. You need to identify the patterns that determine the success of the game. This will allow you to bid on a potentially popular product and plan advertising campaigns. <br />
<br />
Data available up to 2016. Let's say it's December 2016 and we're planning a campaign for 2017. It is necessary to work out the principle of working with data. It doesn’t matter if sales are forecast for 2017 based on 2016 data or 2027 based on 2026 data. <br />
<br />
The abbreviation ESRB (Entertainment Software Rating Board) comes across in the data set - this is an association that determines the age rating of computer games. The ESRB evaluates game content and assigns it to an appropriate age rating, such as Mature, Toddler, or Teen.

# Data Description

- Name - the name of the game
- Platform - platform
- Year_of_Release - year of release
- Genre - game genre
- NA_sales - sales in North America (millions of copies sold)
- EU_sales - sales in Europe (millions of copies sold)
- JP_sales - sales in Japan (millions of copies sold)
- Other_sales - sales in other countries (millions of copies sold)
- Critic_Score - Critics score (maximum 100)
- User_Score - user rating (maximum 10)
- Rating — rating from the ESRB (Entertainment Software Rating Board). This association determines the rating of computer games and assigns them an appropriate age category.
<br />
<br />
Data for 2016 may not be complete.

# Import data files, study general information

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy import stats as st
from IPython.display import display


pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

In [None]:
# Let's define a function for printing a dataset and information about it

def print_df(table):
    
    display(table)
    print()
    table.info()

In [None]:
#df = pd.read_csv('C:/Users/KDG/Google Drive/data science/5 Сборный проект 1/games.csv', sep=',')
df = pd.read_csv('/datasets/games.csv', sep=',')
print_df(df)

# Data preparation

In [None]:
# Let's convert the column names to lowercase

name_columns = df.columns.tolist()
for name in name_columns:
    
    name_low = name.lower()
    df = df.rename(columns={name:name_low})
    
df.info()

## Change column data types to more appropriate ones and fill in gaps

### name

In [None]:
print(df['name'].value_counts().sort_index(ascending=True).head(1000))
print(df['name'].sort_values(ascending=False).unique())

In [None]:
print_df(df[df['name'].isna() == True])

In [None]:
# The lines are almost completely empty, we will not be able to extract any information from them. Let's delete them

index = df[df['name'].isna() == True].index
df = df.drop(index).reset_index(drop=True)
df.info()

### platform

In [None]:
print(df['platform'].value_counts().sort_index(ascending=True).head(1000))
print(df['platform'].sort_values(ascending=False).unique())
print(df.info())

### year_of_release

In [None]:
print(df['year_of_release'].value_counts().sort_index(ascending=True).head(1000))
print(df['year_of_release'].sort_values(ascending=False).unique())

In [None]:
print_df(df[df['year_of_release'].isna() == True])

In [None]:
# Some release years can be extracted from the title and other sources. But in this case, there are 269 of them, which is less than 2% of the entire set,
# so they can be removed

index = df[df['year_of_release'].isna() == True].index
df = df.drop(index).reset_index(drop=True)
df.info()

In [None]:
# Assign values to type `int64`

df['year_of_release'] = df['year_of_release'].astype('int64')
df.info()

### genre

In [None]:
print(df['genre'].value_counts().sort_index(ascending=True).head(1000))
print(df['genre'].sort_values(ascending=False).unique())
df.info()

### na_sales

In [None]:
print(df['na_sales'].value_counts().sort_index(ascending=True).head(1000))
print(df['na_sales'].sort_values(ascending=False).unique())
df.info()

### eu_sales

In [None]:
print(df['eu_sales'].value_counts().sort_index(ascending=True).head(1000))
print(df['eu_sales'].sort_values(ascending=False).unique())
df.info()

### jp_sales

In [None]:
print(df['jp_sales'].value_counts().sort_index(ascending=True).head(1000))
print(df['jp_sales'].sort_values(ascending=False).unique())
df.info()

### other_sales

In [None]:
print(df['other_sales'].value_counts().sort_index(ascending=True).head(1000))
print(df['other_sales'].sort_values(ascending=False).unique())
df.info()

### critic_score

In [None]:
print(df['critic_score'].value_counts().sort_index(ascending=True).head(1000))
print(df['critic_score'].sort_values(ascending=False).unique())

In [None]:
print_df(df[df['critic_score'].isna() == True])

In [None]:
# Let's fill in the critics' rating by the year of release and the genre of the game. Those games that are the only ones by year and genre will be filled with the value "False"

for year in df['year_of_release'].unique():
    for genre in df['genre'].unique():
        
        count_critic_score = df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'critic_score'].count()
        
        if count_critic_score > 0:
            
            median_critic_score = df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'critic_score'].median()
            df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'critic_score'] =\
                df.loc[(df['genre'] == genre) & (df['year_of_release'] == year)].fillna(median_critic_score)
        else:
            
            median_critic_score = False
            df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'critic_score'] =\
                df.loc[(df['genre'] == genre) & (df['year_of_release'] == year)].fillna(median_critic_score)

In [None]:
# Number of "False" values

df.query('critic_score == False').count()

In [None]:
# Almost all "False" of this column, along with no critics rating, no user rating,
# and rating. Considering that there are 931 of them - this is less than 6% of both the initial and this dataset, these lines can be deleted

index = df[df['critic_score'] == False].index
df = df.drop(index).reset_index(drop=True)
df.info()

In [None]:
# Assign values to type `int64`

df['critic_score'] = df['critic_score'].apply(np.ceil).astype('int64')
df.info()

### user_score

In [None]:
print(df['user_score'].value_counts().sort_index(ascending=True).head(1000))
print(df['user_score'].sort_values(ascending=False).unique())

In [None]:
# 'tbd' - apparently to be determined, which means "to be determined". Those. evaluation will be posted later

print_df(df[df['user_score'] == 'tbd'].head(1000))

In [None]:
# Replace the 'tbd' values with "NaN", which will make it possible to calculate the median or average by year and game genre

df.loc[df['user_score'] == 'tbd', 'user_score'] = None
print(df['user_score'].sort_values(ascending=False).unique())

In [None]:
# Fill in the user rating by year of release and genre of the game. Those games that are the only ones by year and genre will be filled with the value "False"

for year in df['year_of_release'].unique():
    for genre in df['genre'].unique():
        
        count_user_score = df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'user_score'].count()
        
        if count_user_score > 0:
            
            median_user_score = df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'user_score'].median()
            df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'user_score'] =\
                df.loc[(df['genre'] == genre) & (df['year_of_release'] == year)].fillna(median_user_score)
        else:
            
            median_user_score = False
            df.loc[(df['genre'] == genre) & (df['year_of_release'] == year), 'user_score'] =\
                df.loc[(df['genre'] == genre) & (df['year_of_release'] == year)].fillna(median_user_score)

In [None]:
# Number of "False" values

df[df['user_score'] == False].count()

In [None]:
# "False" values 38, which means that they can be deleted, since about 93% of the initial dataset will remain

index = df[df['user_score'] == False].index
df = df.drop(index).reset_index(drop=True)
df.info()

In [None]:
# Assign values to type `float64`

df['user_score'] = df['user_score'].astype('float64')
df.info()

### rating 

In [None]:
print(df['rating'].value_counts().sort_index(ascending=True).head(1000))
print(df['rating'].sort_values(ascending=False).unique())

In [None]:
print_df(df[df['rating'].isna() == True])

In [None]:
# "NaN" values can be left, since in this situation they cannot be filled or deleted in any way

## Calculation of total sales in all regions

In [None]:
df['sum_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales'] + df['other_sales']
df.info()

# Exploratory data analysis

## Calculation of the current period

In [None]:
# Number of released games by year

df.groupby('year_of_release')['year_of_release'].count()

In [None]:
# Let's build a summary table of the amount of sales by platform and by year

table = df.pivot_table(index=['platform', 'year_of_release'], values=['sum_sales'], aggfunc='sum').reset_index()

In [None]:
# We will select the top 5 platforms by the amount of sales

top_table = df.pivot_table(index=['platform'], values=['sum_sales'], aggfunc='sum').reset_index().sort_values(by = 'sum_sales', ascending=False).head(5)
top_list = top_table['platform'].tolist()

In [None]:
# Let's build charts of the top 5 platforms by year. We also calculate the lifetime of each platform

time_in_year = [] # lifetime of each platform in years

for name_platform in top_list:
    
    plot_table = table.query('platform == @name_platform')
    plot_table.plot(x='year_of_release', y='sum_sales', kind='bar', title=name_platform)
    plt.show()
    print('Sales start year:', plot_table['year_of_release'].min())
    print('Sales end year:', plot_table['year_of_release'].max())
    time_in_year.append(plot_table['year_of_release'].max() - plot_table['year_of_release'].min())
    print()
    
print('Median number of years of existence:', np.median(time_in_year))    

<code style="background:black;color:violet">So the current period is **2007 to 2016** </code>

## Exploring gaming platforms

In [None]:
# Get a table with the current period

table_actual = table.query('year_of_release >= 2007')

In [None]:
# Explore platforms with charts

ax = plt.show()

for name_platform in table_actual['platform'].unique():  

    plot_table = table_actual.query('platform == @name_platform')
    ax = plot_table.plot(x='year_of_release', y='sum_sales', kind='line', ax = ax, label=name_platform, figsize=(15,15), grid=True, xlim = (2007, 2016))   

<code style="background:black;color:violet">The graph shows that sales are falling, which means a new, more advanced platform should appear soon.
But it should also be taken into account that the data for 2016 is incomplete.</code>

In [None]:
# Rebuild "table_actual" so that platform names become columns

table_actual_for_boxplot = table_actual.pivot_table(index=['year_of_release'], columns=['platform'], values='sum_sales', aggfunc='first')
columns_for_boxplot = table_actual_for_boxplot.columns.tolist()

# Plot a box and whisker plot of global game sales by platform

table_actual_for_boxplot.boxplot(column=columns_for_boxplot, figsize=(15,15), grid=True)

# Plot builds but gives a type error, probably due to the fact that I pass the "columns_for_boxplot" variable as list",
# although I tried to convert to an array, but then it does not work at all

<code style="background:black;color:violet">The spread of sales across platforms is quite large.
Choosing the right platform does not guarantee sales success, but it can help.</code>

## Dependence of global sales on the evaluation of critics and users

In [None]:
# Get a table with the current period. Let's take "PS4" as the platform of consideration,
# because for 2016 it is the most popular

df_actual_PS4 = df.query('year_of_release >= 2007 and platform == "PS4"')

In [None]:
# Let's see how user reviews and critics affect sales

df_actual_PS4.plot(x='critic_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 100), title='critic_score')
corr_critic = df_actual_PS4['critic_score'].corr(df_actual_PS4['sum_sales'])
plt.show()
print('Correlation between critical score and global sales:', corr_critic)
print()


df_actual_PS4.plot(x='user_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 10), title='user_score') 
corr_user = df_actual_PS4['user_score'].corr(df_actual_PS4['sum_sales'])
plt.show()
print('Correlation between user rating and global sales:', corr_user)

In [None]:
# Check output also for "PC" and "Xbox One" platforms

# "PC"

df_actual_PC = df.query('year_of_release >= 2007 and platform == "PC"')

df_actual_PC.plot(x='critic_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 100), title='critic_score')
corr_critic = df_actual_PC['critic_score'].corr(df_actual_PC['sum_sales'])
plt.show()
print('Correlation between critical score and global sales:', corr_critic)
print()


df_actual_PC.plot(x='user_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 10), title='user_score') 
corr_user = df_actual_PC['user_score'].corr(df_actual_PC['sum_sales'])
plt.show()
print('Correlation between user rating and global sales:', corr_user)

In [None]:
# "Xbox One"

df_actual_XOne = df.query('year_of_release >= 2007 and platform == "XOne"')

df_actual_XOne.plot(x='critic_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 100), title='critic_score')
corr_critic = df_actual_XOne['critic_score'].corr(df_actual_XOne['sum_sales'])
plt.show()
print('Correlation between critical score and global sales:', corr_critic)
print()


df_actual_XOne.plot(x='user_score', y='sum_sales', kind='scatter', figsize=(5,5), grid=True, xlim = (0, 10), title='user_score') 
corr_user = df_actual_XOne['user_score'].corr(df_actual_XOne['sum_sales'])
plt.show()
print('Correlation between user rating and global sales:', corr_user)

<code style="background:black;color:violet">Based on the critics' assessment, we can indirectly infer what global sales might be.
There is a correlation, but not very strong.</code>

## Distribution of profits by genre

In [None]:
# Get a table with the current period

df_actual = df.query('year_of_release >= 2007')

# Profitability of games by genre

table_actual_genres = df_actual.pivot_table(index='genre', values='sum_sales', aggfunc={'sum', 'count'})
table_actual_genres['ratio_sum/count'] = table_actual_genres['sum'] / table_actual_genres['count']
display(table_actual_genres.sort_values(by='ratio_sum/count', ascending=False))

<code style="background:black;color:violet">The top-grossing genre is **Shooter**, followed by **Platform** and **Sports**.
The lowest selling genre is **Adventure**. </code>

# User portrait of each region (NA, EU, JP)

## NA

### Most popular platforms (top 5)

In [None]:
df_actual_p_t_platform_na = df_actual.pivot_table(index='platform', values='na_sales', aggfunc={'sum'})
df_actual_p_t_platform_na['part'] = df_actual_p_t_platform_na['sum'] / df_actual_p_t_platform_na['sum'].sum()
print(df_actual_p_t_platform_na.sort_values(by='sum', ascending=False).head(5))

### Most popular genres (top 5)

In [None]:
df_actual_p_t_genre_na = df_actual.pivot_table(index='genre', values='na_sales', aggfunc={'sum'})
df_actual_p_t_genre_na['part'] = df_actual_p_t_genre_na['sum'] / df_actual_p_t_genre_na['sum'].sum()
print(df_actual_p_t_genre_na.sort_values(by='sum', ascending=False).head(5))

### Impact of ESRB rating on sales

In [None]:
df_actual.pivot_table(index='rating', values='na_sales', aggfunc={'sum'}).sort_values(by='sum', ascending=False)

<code style="background:black;color:violet">Based on the critics' assessment, we can indirectly infer what global sales might be.
There is a correlation, but not very strong.</code>

## EU

### Most popular platforms (top 5)

In [None]:
df_actual_p_t_platform_eu = df_actual.pivot_table(index='platform', values='eu_sales', aggfunc={'sum'})
df_actual_p_t_platform_eu['part'] = df_actual_p_t_platform_eu['sum'] / df_actual_p_t_platform_eu['sum'].sum()
print(df_actual_p_t_platform_eu.sort_values(by='sum', ascending=False).head(5))

### Most popular genres (top 5)

In [None]:
df_actual_p_t_genre_eu = df_actual.pivot_table(index='genre', values='eu_sales', aggfunc={'sum'})
df_actual_p_t_genre_eu['part'] = df_actual_p_t_genre_eu['sum'] / df_actual_p_t_genre_eu['sum'].sum()
print(df_actual_p_t_genre_eu.sort_values(by='sum', ascending=False).head(5))

### Impact of ESRB rating on sales

In [None]:
df_actual.pivot_table(index='rating', values='eu_sales', aggfunc={'sum'}).sort_values(by='sum', ascending=False)

<code style="background:black;color:violet">The most commercially successful games in Europe on the platform **Play Station 3**, **Action** genre, rated **E**.</code>

## JP

### Most popular platforms (top 5)

In [None]:
df_actual_p_t_platform_jp = df_actual.pivot_table(index='platform', values='jp_sales', aggfunc={'sum'})
df_actual_p_t_platform_jp['part'] = df_actual_p_t_platform_jp['sum'] / df_actual_p_t_platform_jp['sum'].sum()
print(df_actual_p_t_platform_jp.sort_values(by='sum', ascending=False).head(5))

### Most popular genres (top 5)

In [None]:
df_actual_p_t_genre_jp = df_actual.pivot_table(index='genre', values='jp_sales', aggfunc={'sum'})
df_actual_p_t_genre_jp['part'] = df_actual_p_t_genre_jp['sum'] / df_actual_p_t_genre_jp['sum'].sum()
print(df_actual_p_t_genre_jp.sort_values(by='sum', ascending=False).head(5))

### Impact of ESRB rating on sales

In [None]:
df_actual.pivot_table(index='rating', values='jp_sales', aggfunc={'sum'}).sort_values(by='sum', ascending=False)

<code style="background:black;color:violet">Japan's Most Commercially Successful **Nintendo DS** **Role-Playing** Games Rated **E**.</code>

# Testing hypotheses

## Average user ratings for Xbox One and PC platforms are the same

### Null hypothesis

Average user ratings for `Xbox One` and `PC` platforms are the same

### Alternative

Average user ratings for `Xbox One` and `PC` platforms are NOT the same

### Вычисление

In [None]:
alpha = 0.01 # Since the sample is quite large, we choose a smaller level of statistical significance

df_actual_user_score_xbox_one = df_actual.query('platform == "XOne"')['user_score']
df_actual_user_score_pc = df_actual.query('platform == "PC"')['user_score']

results = st.ttest_ind(
    df_actual_user_score_xbox_one, 
    df_actual_user_score_pc)

print('p-value: ', results.pvalue)

if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Rejecting the null hypothesis")

<code style="background:black;color:violet">The null hypothesis was not rejected, which means that there are no significant differences between the samples.</code>

## Average user ratings for Action and Sports are different

### Null hypothesis

Average user ratings for `Action` and `Sports` genres are the same

### Alternative

Average user ratings for `Action` and `Sports` genres are NOT the same

### Calculation

In [None]:
alpha = 0.01 # Since the sample is quite large, we choose a smaller level of statistical significance

df_actual_user_score_action = df_actual.query('genre == "Action"')['user_score']
df_actual_user_score_sports = df_actual.query('genre == "Sports"')['user_score']

results = st.ttest_ind(
    df_actual_user_score_action, 
    df_actual_user_score_sports)

print('p-value: ', results.pvalue)

if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Failed to reject the null hypothesis")

<code style="background:black;color:violet">Average user rating for the **Action** genre is not equal to the average user rating for the **Sports** genre. </code>

# Conclusion

To plan an advertising company, first of all, you need to find out what region of sale we are talking about. <br />
Next, you need to clarify the genre of the game, platform and age rating. <br />
After that, it is necessary to find out the assessment of critics, since it indirectly affects sales. <br />

For example, during the study, the following was found: <br />
<br />
In North America, an `Action` or `Shooter` game on the `Xbox 360` platform with an `E` or `Mature 17+` rating will be a commercial success.
<br />
In Europe, a game on the `Play Station 3` platform of the `Action` or `Shooter` genre with an `E` or `Mature 17+` rating will be a commercial success.
<br />
In Japan, a game on the `Nintendo DS` platform of the `Role-Playing` or `Action` genre with an `E` or `Mature 17+` rating will be a commercial success.
<br />
<br />
If you need to be successful all over the world, then you should make a game in the `Shooter` genre on the `Play Station 4` platform.