# Game market research

Game online store, which sells computer games all over the world. Historical game sales data, user and expert ratings, genres and platforms (such as Xbox or PlayStation) are available from public sources. It is necessary to identify patterns that determine the success of the game. This will allow you to bid on a potentially popular product and plan advertising campaigns.
I have data up to 2016. It is now December 2016 and we are planning a campaign for 2017. It is necessary to work out the principle of working with data.

Data Description games.csv

- Name - the name of the game
- Platform - platform
- Year_of_Release - year of release
- Genre - game genre
- NA_sales - Sales in North America (millions of dollars)
- EU_sales - sales in Europe (millions of dollars)
- JP_sales - sales in Japan (millions of dollars)
- Other_sales - sales in other countries (millions of dollars)
- Critic_Score - Critics score (from 0 to 100)
- User_Score - user score (from 0 to 10)
- Rating - rating from the ESRB (Entertainment Software Rating Board). This association determines the rating of computer games and assigns them an appropriate age category.

Data for 2016 may not be complete.

# Шаг 1. Open data file and study general information
Import the necessary libraries and read the file

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as st

In [2]:
df = pd.read_csv('/datasets/games.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/games.csv'

Exploring general information about data

In [None]:
df.info()

We see 16713 values, while not all data is available, there are many gaps

Let's see the first 10 values

In [None]:
df.head(10)

The main omissions are concentrated in the estimates. Given the size of the database, this can be caused by 
1) Game size (no one rates smaller games) 
2) Age of limitation (older games weren't rated in the data source) 
3) Failed to load from data or other issues

In [None]:
df.hist(figsize=(15, 20));

We see outliers in the data in all columns except for the year and the rating of critics

# Prepare the data
Change the column names (reduce to lower case);
Apply the lower case function to the names and check the result

In [None]:
df.columns = df.columns.str.lower()
print(df.columns)

In [None]:
# Calculate the proportion of outstanding values and remove them if it is small
print(df[df['na_sales']>3]['na_sales'].count())
print(100*df[df['na_sales']>3]['na_sales'].count()/df['na_sales'].count())
print(df['na_sales'].hist(bins=10,range=(0,3)))
df = df.drop(index = df[(df['na_sales'] >3)].index)
#validation
print(df[df['na_sales']>3]['na_sales'].count())


In [None]:
# Calculate the proportion of outstanding values and remove them if it is small
print(df[df['eu_sales']>3]['eu_sales'].count())
print(100*df[df['eu_sales']>3]['eu_sales'].count()/df['eu_sales'].count())
print(df['eu_sales'].hist(bins=10,range=(0,3)))
df = df.drop(index = df[(df['eu_sales'] >3)].index)
#validation
print(df[df['eu_sales']>3]['eu_sales'].count())


In [None]:
# Calculate the proportion of outstanding values and remove them if it is small
print(df[df['other_sales']>3]['other_sales'].count())
print(100*df[df['other_sales']>3]['other_sales'].count()/df['other_sales'].count())
print(df['other_sales'].hist(bins=10,range=(0,3)))
df = df.drop(index = df[(df['other_sales'] >3)].index)
#validation
print(df[df['other_sales']>3]['other_sales'].count())


In [None]:
#Calculate the proportion of outstanding values and remove them if it is small
print(df[df['jp_sales']>3]['jp_sales'].count())
print(100*df[df['jp_sales']>3]['jp_sales'].count()/df['jp_sales'].count())
print(df['jp_sales'].hist(bins=10,range=(0,3)))
df = df.drop(index = df[(df['jp_sales'] >3)].index)
#validation
print(df[df['jp_sales']>3]['jp_sales'].count())


In [None]:
df.hist(figsize=(15, 20));

Let's see how many passes

In [None]:
df.isna().sum()

The main gaps are in the estimates. There's still a bit in years and game titles

Delete empty lines in games and check

In [None]:
df.dropna(subset=['name'], inplace=True)
df['name'].isna().sum()

Convert the data to the required types. Let's describe in which columns the data type was replaced and why
Finding out the current data types

In [None]:
df.dtypes

Let's translate the numerical data (estimates in numbers, years) into a numerical format. Leave the rest as they are text.
Data containing 'tbd' can be cast to NaN, so 'tbd - to be determined. Value not yet defined

In [None]:
df['user_score'] = df['user_score'].replace('tbd', '', regex=True)

Check that with gaps in the year field

In [None]:
df[df['year_of_release'].isna()==True]

In [None]:
df.query('name == "FIFA Soccer 2004"')

We see that it is possible to fill in the gaps through other platforms. Let's try and see the result

In [None]:
print(df['year_of_release'].isna().sum())
for i in df[df['year_of_release'].isnull() == True].index:  
    df.loc[i,'year_of_release'] = df.loc[df['name'] == df['name'][i], 'year_of_release'].max()
print(df['year_of_release'].isna().sum())

Replacing the values of numerical variables

In [None]:
df['user_score']=pd.to_numeric(df['user_score'], errors='coerce')

In [None]:
df['year_of_release'] = df['year_of_release'].astype('Int64')
df['user_score'].fillna('')
df['critic_score'].fillna('')
df['critic_score'].astype('float')
df['user_score']=pd.to_numeric(df['user_score'], errors='coerce')

In [None]:
df.dtypes

Calculate the total sales in all regions and write them in a separate column.

In [None]:
df['total_sales'] = (df['na_sales']+df['eu_sales']+df['jp_sales']+df['other_sales'])
print('Total sales in all regions:',df['total_sales'].sum())

# Step 3:  Exploratory data analysis
See how many games were released over the years. Is data for all periods important?

We group the data by years and build a graph.
Data before 2000 looks unrepresentative, as it differs greatly from the more recent period

In [None]:
(
df
.pivot_table(index='year_of_release', values='name', aggfunc='count')
.plot(y='name', kind='bar', figsize=(16, 8), title='Number of new games for each year', grid=True, color='red', alpha=0.6, legend=False)
.set(xlabel='Release year', ylabel='Number of new games')
)
plt.show()

We have seen rapid growth since 1995. The drop may be caused by incomplete data on the games market, as mobile platforms (Iphone and analogues) are not taken into account. The first iPhone appeared in 2012, while the market already existed based on nokia, sony, samsung

See how sales have changed across platforms. Select the platforms with the highest total sales and plot the distribution by year.

In [None]:
df.groupby('platform')['total_sales'].sum().sort_values(ascending=False).plot(kind='bar', figsize=(13,4))
plt.show()

We build charts for the top platforms

In [None]:
top_platforms  = list(df.groupby('platform')['total_sales'].sum().sort_values(ascending=False)[0:8].keys())
for i in top_platforms:
    df[df['platform'] == i].pivot_table(index='year_of_release', values='total_sales', aggfunc='sum').plot(
    kind='bar', figsize=(13,4))
    plt.title(i)
    plt.xlabel("Years")
    plt.ylabel("Sales")

It can be seen from the graphs that the life of the platform is 10 years, a new platform appears in the middle, so we take 5 years from 2011
2016 excluded due to incomplete data

In [None]:
df_new=df[(df['year_of_release']>2012)&(df['year_of_release']<2017)]
top_platforms  = list(df_new.groupby('platform')['total_sales'].sum().sort_values(ascending=False)[0:8].keys())

Look at the overall distribution of games by genre. What about the most profitable genres? Do genres with high and low sales stand out? Which platforms are leading in sales, rising or falling? Pick a few potentially profitable platforms.

In [None]:
(
    df_new
    .pivot_table(index='platform', values='total_sales', aggfunc='sum')
    .sort_values('total_sales', ascending = False)
    .plot(y='total_sales', kind='bar', figsize=(16, 8), title='Total sales by platform', grid=False, color='red', alpha=0.5, legend=False)
    .set(xlabel='Platform', ylabel='Sales volume')
)
plt.show()

top_3_platforms = (
    df_new
    .pivot_table(index='platform', values='total_sales', aggfunc='sum')
    .sort_values('total_sales', ascending = False)
    .head(3).index.tolist()
)
print('Top 3 platforms:',top_3_platforms)

Let's leave in the analysis only the top

In [None]:
df_new = df_new.query('platform in @top_3_platforms')

Plot a box-and-mustache plot of each game's global sales and breakdown by platform. Is there a big difference in sales? What about average sales across platforms? Describe the result.

In [None]:
df_new.groupby('platform')['total_sales'].describe()

In [None]:
table = df_new.groupby('platform')['total_sales'].describe()
plt.rcParams['figure.figsize']=(16, 5)
ax = sns.boxplot(x="platform", y="total_sales", data=df_new, palette='rainbow_r')
ax.set_ylim(0, 1)
plt.show()

We see that the median x360 and ps4 are the same - consoles of the same generation, but ps4 has a wider spread in sales. For PS3, the spread of sales is smaller, as is the median

Sales at xone and ps4 can be explained by the stage of maturity - they are in their prime when their predecessors are already completing their cycle, this also affects the sample size

See how sales within one popular platform are impacted by user and critic reviews. Plot a scatterplot and calculate the correlation between reviews and sales. Formulate conclusions and correlate them with sales of games on other platforms.

In [None]:
print(top_platforms)

In [None]:
#Let's write a function that will draw scatter plots and calculate correlations

def corr_func(name_of_platform):
    platform = df_new[df_new['platform']==name_of_platform]
    fig, ax = plt.subplots(1 ,2, figsize=(15,5))
    sns.scatterplot(x='user_score', y='total_sales', data=platform, ax=ax[0])
    sns.scatterplot(x='critic_score', y='total_sales', data=platform, ax=ax[1])
    fig.suptitle(name_of_platform, fontsize=15)
    ax[0].set(xlabel='User scor')
    ax[1].set(xlabel='Critic score')
    ax[0].set(ylabel='Total sales')
    ax[1].set(ylabel='Total sales')
    plt.show()
    
    correl = platform['user_score'].corr(platform['total_sales'])
    critic_correl = platform['critic_score'].corr(platform['total_sales'])
    print('Correlation between critic reviews and gaming platform ', name_of_platform.upper(), round(critic_correl,1))
    print('Correlation between user reviews and sales ', name_of_platform.upper(), round(correl,1))

    print('\n')

In [None]:
#Using a loop, display all 6 charts
for platform in top_platforms:
    corr_func(platform)

We see a low correlation between ratings and sales. At the same time, reviews from critics are trusted more (higher correlation) than users, with the exception of Asian platforms

Look at the overall distribution of games by genre. What about the most profitable genres? Do genres with high and low sales stand out?

In [None]:
sales_d = df_new.pivot_table(index='genre', values='total_sales', aggfunc='sum').sort_values(
    by='total_sales', ascending=False).reset_index().rename_axis(None, axis=1)
sales_d

In [None]:
sales_d = df_new.pivot_table(index='genre', values='total_sales', aggfunc='median').sort_values(
    by='total_sales', ascending=False).reset_index().rename_axis(None, axis=1)
sales_d

The most profitable genres are Action, Shooter, Sport. The least successful Strategy and Puzzle. Buyers are attracted by fast games.
Marginality cannot be inferred because there is no data on production costs. Platform have a high median, which indicates a greater dispersion of the Action genre and the prospects of the Platform genre

# Step 4. Create a user profile for each region
Define for the user of each region (NA, EU, JP):
The most popular platforms (top 5). Describe the differences in sales shares.
The most popular genres (top 5). Explain the difference.
Does the ESRB rating affect sales in a particular region?

NA

In [None]:
df2=df_new.groupby('platform')['na_sales'].sum().sort_values(ascending=False)
a = df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()

print('Top 5:',a.head(5).index.tolist())

In [None]:
df2=df_new.groupby('genre')['na_sales'].sum().sort_values(ascending=False)
a=df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

In [None]:
b=df_new.groupby('rating')['na_sales'].sum().sort_values(ascending=False)
b.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',b.head(5).index.tolist())

EU

In [None]:
df2=df_new.groupby('platform')['eu_sales'].sum().sort_values(ascending=False)
a=df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

In [None]:
df2=df_new.groupby('genre')['eu_sales'].sum().sort_values(ascending=False)
a=df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

In [None]:
a=df_new.groupby('rating')['eu_sales'].sum().sort_values(ascending=False)
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

JP

In [None]:
df2=df_new.groupby('platform')['jp_sales'].sum().sort_values(ascending=False)
a=df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

In [None]:
df2=df_new.groupby('genre')['jp_sales'].sum().sort_values(ascending=False)
a=df2*100/df2.sum()
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(5).index.tolist())

In [None]:
a=df_new.groupby('rating')['jp_sales'].sum().sort_values(ascending=False)
a.plot(kind='bar', figsize=(13,4))
plt.show()
print('Top 5:',a.head(3).index.tolist())

In general, Europe and America are similar, Asia is different, it is easy to explain the differences in culture.
In all regions, E and M collect large sales, the leader in Asia is different. At the same time, games without a rating collect the lion's share. This can be explained by the lack of coverage, especially for the Asian region. An additional factor could be a long tail (not rated) or an unwillingness to rate the game.
Asia prefers its own consoles and excels in genre passions

# Step 5. Test the hypotheses
The average user ratings of the Xbox One and PC platforms are the same;
The average user ratings for Action and Sports genres are different.
Set the alpha threshold yourself.
Explain:
How did you formulate the null and alternative hypotheses;
What criterion was used to test the hypotheses and why.

Null hypothesis - population means are equal
Alternative - population means are not equal
We use the student's t test, since we have a sample, not a general population

In [None]:
df_new=df[(df['year_of_release']>2012)&(df['year_of_release']<2016)]
alpha= 0.05
sample_1=df_new[df_new['platform']=='XOne']['user_score'].dropna()
sample_2=df_new[df_new['platform']=='PC']['user_score'].dropna()
results = st.ttest_ind(sample_1, sample_2, equal_var = False)
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Failed to reject the null hypothesis")

We can say with a high 95% certainty that the average user ratings of the Xbox One and PC platforms are the same

In [None]:
alpha= 0.05
sample_1=df_new[df_new['genre']=='Action']['user_score'].dropna()
sample_2=df_new[df_new['genre']=='Sports']['user_score'].dropna()
results = st.ttest_ind(sample_1, sample_2, equal_var = False)
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Failed to reject the null hypothesis")

It can be said with a high 95% certainty that the average user ratings of the Action and Sports genres are NOT the same

Hypothesis: "The average user ratings of the Xbox one and PC platforms are the same." The null hypothesis could not be disproven.
Hypothesis: "Average user ratings for the Action and Sports genres are the same." We reject the null hypothesis.

# STEP 6. General conclusion
data omitted