<h1>Introduction</h1>
This analysis looks at different angles of the fuel data for the vehicle spreadsheet provided for the project. It looks at fuel in the scope of transmission and vehicle types and looks at the average price for different products.

These are the different libraries that are necessary to complete the analysis.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats as st
from math import factorial as ft 
import plotly.express as px

This reads the data into a DataFrame and displays the information and the first five rows of the data.

In [2]:
ice_data = pd.read_csv('/Users/leahdeyoung/Desktop/GitHub/ice-games-practicum/moved_games.csv', encoding = "utf-8")

display(ice_data.head())
ice_data.info()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


<h2>Data PreProcessing</h2>

This code checks for any fully duplicate rows. There appear to be none.

In [3]:
print(ice_data.duplicated().sum())

0


I replaced all the column names with the lowercase version of the name.

In [4]:
ice_data = ice_data.rename(columns={
    'Name': 'name', 
    'Platform': 'platform',
    'Year_of_Release': 'year_of_release',
    'Genre': 'genre',
    'NA_sales': 'na_sales',
    'EU_sales': 'eu_sales',
    'JP_sales': 'jp_sales',
    'Other_sales': 'other_sales',
    'User_Score': 'user_score',
    'Critic_Score': 'critic_score',
    'Rating': 'rating'
})
print(ice_data.columns)

Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
       'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
      dtype='object')


This code checks for duplicates in the name data by converting all the names to lowercase values and dropping any duplicates that also have a duplicate year of release and platform. This is because some games have new releases in different years and games could be released on different platforms.

In [5]:
ice_data.info()
print(ice_data['name'].value_counts())
print(ice_data['name'].unique())
ice_data['name_lowercase'] = ice_data['name'].str.lower()
print(ice_data['name'].value_counts())
print(ice_data['name_lowercase'].unique())

ice_data = ice_data.drop_duplicates(subset=['name_lowercase', 'year_of_release', 'platform']).reset_index(drop=True)

ice_data.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
Need for Speed: Most Wanted                         12
Ratatouille                                          9
LEGO Marvel Super Heroes                             9
FIFA 14                                              9
Madden NFL 07        

This code checks for names that are missing, then grabs the rows where the name is missing to view for analysis.

It looks like quite a bit of information is missing about this row; however, there were in fact sales in Europe and North America for this game, and the platform is valid and used for other games. Therefore, I decided to keep the row and add the string "Unknown Name" to the name field.

In [6]:
print(ice_data['name'].isna().sum())
print(ice_data.query("name.isna()"))
print(ice_data['platform'].value_counts().head(25))
ice_data['name'] = ice_data['name'].fillna('Unknown Name')
print(ice_data['name'].isna().sum())


1
    name platform  year_of_release genre  na_sales  eu_sales  jp_sales  \
659  NaN      GEN           1993.0   NaN      1.78      0.53       0.0   

     other_sales  critic_score user_score rating name_lowercase  
659         0.08           NaN        NaN    NaN            NaN  
PS2     2161
DS      2151
PS3     1330
Wii     1320
X360    1262
PSP     1209
PS      1197
PC       974
XB       824
GBA      822
GC       556
3DS      520
PSV      430
PS4      392
N64      319
XOne     247
SNES     239
SAT      173
WiiU     147
2600     133
NES       98
GB        98
DC        52
GEN       28
NG        12
Name: platform, dtype: int64
0


I looked at sample rows that are missing the year of release and I noticed that some have the year in the name field. For those rows, I used the year listed in the name to fill the year of release. For the other rows, I replaced the missing year of release with median year of release. For all rows I converted the datatype to datetime and made sure to isolate just the year.

In [7]:
print(ice_data['year_of_release'].isna().sum())

#fill year from name
ice_data['year_of_release'] = ice_data['year_of_release'].where(ice_data['name'] != 'PES 2009: Pro Evolution Soccer', 2009)
ice_data['name_year'] = ice_data.query("year_of_release.isna() and (name.str.contains('200') or name.str.contains('19'))")['name'].str[-4:]
ice_data['year_of_release'] = ice_data['year_of_release'].where((ice_data['year_of_release'].notna() & ice_data['name_year'].isna()), ice_data['name_year'])

#check work
print(ice_data['year_of_release'].isna().sum())

#fill missing fields by median
year_median = ice_data['year_of_release'].median()
year_median = round(year_median, 0)
ice_data['year_of_release'] = ice_data['year_of_release'].where((ice_data['year_of_release'].notna()) , year_median) 

#convert datatype
ice_data['year_of_release'] = ice_data['year_of_release'].astype(int)
ice_data['year_of_release'] = pd.to_datetime(ice_data['year_of_release'], format='%Y')
ice_data['year_of_release'] = ice_data['year_of_release'].dt.year

#check work
print(ice_data['year_of_release'].isna().sum())



269
254
0


This code checks for genres that are missing, then grabs the rows where the genre is missing to view for analysis.

It looks like this is the same row that was missing the name. Based on my previous analysis, I decided to keep the row and add the string "Unknown Genre" to the genre field.

In [8]:
print(ice_data['genre'].isna().sum())
print(ice_data.query("genre.isna()"))
ice_data['genre'] = ice_data['genre'].fillna('Unknown Genre')
print(ice_data['genre'].isna().sum())


1
             name platform  year_of_release genre  na_sales  eu_sales  \
659  Unknown Name      GEN             1993   NaN      1.78      0.53   

     jp_sales  other_sales  critic_score user_score rating name_lowercase  \
659       0.0         0.08           NaN        NaN    NaN            NaN   

    name_year  
659       NaN  
0


I looked at sample rows that are missing the critic score and I did not see any pattern. It is likely these are simply games that did not have a critic score yet. This code calculates the mean critic score by year of release and fills in the missing critic score data with the appropriate mean by year.

In [9]:
#view data and compare
print(ice_data['critic_score'].isna().sum())
print(ice_data.query("critic_score.isna()").sample(5))
print(ice_data.query("critic_score.isna() and user_score.notna()")['name'].count())

#calculate critic score by year and fill missing values
critic_score_mean = ice_data.groupby('year_of_release')['critic_score'].mean()
critic_score_mean = critic_score_mean.fillna(0)
critic_score_mean = critic_score_mean.reset_index().rename(columns={0: 'year_of_release', 'critic_score': 'mean_critic_score'})
ice_data = ice_data.merge(critic_score_mean, on='year_of_release', how='left')
ice_data['critic_score'] = ice_data['critic_score'].fillna(ice_data['mean_critic_score'])

#check work
print(ice_data['critic_score'].isna().sum())


8577
                                                    name platform  \
3332                   MonHun Nikki: Poka Poka Ailu Mura      PSP   
13543                  Kamen Rider: Battride War Genesis      PS3   
16016                                       Dream Dancer       DS   
4394   Transformers: Revenge of the Fallen (Wii & PS2...       DS   
7554                              All Star Pro-Wrestling      PS2   

       year_of_release         genre  na_sales  eu_sales  jp_sales  \
3332              2010  Role-Playing      0.00      0.00      0.60   
13543             2016        Action      0.00      0.00      0.04   
16016             2007          Misc      0.01      0.00      0.00   
4394              2009        Action      0.26      0.14      0.00   
7554              2000      Fighting      0.00      0.00      0.20   

       other_sales  critic_score user_score rating  \
3332          0.00           NaN        NaN    NaN   
13543         0.00           NaN        NaN    NaN 

I looked at sample rows that are missing the rating, and I noticed that any game that is missing a rating is also missing a user score, but I am actually not sure if there is a conclusion to be drawn there. The code then fills blank values with "Rating Unknown".

In [10]:
print(ice_data['rating'].isna().sum())
print(ice_data.query("rating.isna()").sample(5))
print(ice_data.query("rating.isna() and user_score.notna()")['name'].count())
ice_data['rating'] = ice_data['rating'].fillna('Rating Unknown')
print(ice_data['rating'].isna().sum())

6765
                                                    name platform  \
4320                                         God Eater 2      PSV   
10489                       Little Battlers eXperience W      3DS   
14996                         DEATH NOTE: L o Tsugu Mono       DS   
15132               Strike Witches: Shirogane no Tsubasa      PSP   
10741  Disney Sing It! High School Musical 3: Senior ...     X360   

       year_of_release         genre  na_sales  eu_sales  jp_sales  \
4320              2013  Role-Playing      0.00       0.0      0.45   
10489             2013  Role-Playing      0.00       0.0      0.10   
14996             2007     Adventure      0.00       0.0      0.02   
15132             2012      Strategy      0.00       0.0      0.02   
10741             2009          Misc      0.09       0.0      0.00   

       other_sales  critic_score user_score rating  \
4320          0.00     71.278388        NaN    NaN   
10489         0.00     71.278388        NaN    NaN 

Two of the rows has a user score of "tbd". I am guessing that this means that a user has not giving this a score or user rating yet. To mitigate this, I averaged all the unique ratings values together and replaced the "tbd" values with this average. The remaining code calculates the mean user score by year of release and fills in the missing (NaN) user score data with the appropriate mean by year.

In [11]:
print(ice_data['user_score'].isna().sum())
print(ice_data['user_score'].unique())
print(ice_data.query("user_score == 'tbd'"))

#calculate overall user_score mean to replace "tbd" value
user_score_dropna = ice_data['user_score'].dropna()
score_list = list(user_score_dropna.unique())
score_list.remove('tbd')
score_float_list = [float(score) for score in score_list]
score_mean = sum(score_float_list) / len(score_float_list)
ice_data['user_score'] = ice_data['user_score'].where((ice_data['user_score'] != 'tbd') , score_mean) 

#fill NaN values with user_score mean by year
ice_data['user_score'] = ice_data['user_score'].astype(float)
user_score_mean = ice_data.groupby('year_of_release')['user_score'].mean()
user_score_mean = user_score_mean.fillna(0)
user_score_mean = user_score_mean.reset_index().rename(columns={0: 'year_of_release', 'user_score': 'mean_user_score'})
ice_data = ice_data.merge(user_score_mean, on='year_of_release', how='left')
ice_data['user_score'] = ice_data['user_score'].fillna(ice_data['mean_user_score'])

#check work
print(ice_data['user_score'].isna().sum())

6700
['8' nan '8.3' '8.5' '6.6' '8.4' '8.6' '7.7' '6.3' '7.4' '8.2' '9' '7.9'
 '8.1' '8.7' '7.1' '3.4' '5.3' '4.8' '3.2' '8.9' '6.4' '7.8' '7.5' '2.6'
 '7.2' '9.2' '7' '7.3' '4.3' '7.6' '5.7' '5' '9.1' '6.5' 'tbd' '8.8' '6.9'
 '9.4' '6.8' '6.1' '6.7' '5.4' '4' '4.9' '4.5' '9.3' '6.2' '4.2' '6' '3.7'
 '4.1' '5.8' '5.6' '5.5' '4.4' '4.6' '5.9' '3.9' '3.1' '2.9' '5.2' '3.3'
 '4.7' '5.1' '3.5' '2.5' '1.9' '3' '2.7' '2.2' '2' '9.5' '2.1' '3.6' '2.8'
 '1.8' '3.8' '0' '1.6' '9.6' '2.4' '1.7' '1.1' '0.3' '1.5' '0.7' '1.2'
 '2.3' '0.5' '1.3' '0.2' '0.6' '1.4' '0.9' '1' '9.7']
                                           name platform  year_of_release  \
119                               Zumba Fitness      Wii             2010   
301              Namco Museum: 50th Anniversary      PS2             2005   
520                             Zumba Fitness 2      Wii             2011   
645                                uDraw Studio      Wii             2010   
657    Frogger's Adventures: Temple of th

This code creates a total sales column and adds the other three regions together to determine the value

In [12]:
ice_data['total_sales'] = ice_data['na_sales'] + ice_data['eu_sales'] + ice_data['jp_sales'] + ice_data['other_sales']

This code drops all the extra columns created to clean the data.

In [13]:
ice_data.drop(['name_lowercase', 'name_year', 'mean_critic_score', 'mean_user_score'], axis=1, inplace=True)



This code checks the info on the dataframe and a random ten rows of the DataFrame

In [14]:
ice_data.info()
display(ice_data.sample(10))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16713 entries, 0 to 16712
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16713 non-null  object 
 2   year_of_release  16713 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16713 non-null  float64
 5   eu_sales         16713 non-null  float64
 6   jp_sales         16713 non-null  float64
 7   other_sales      16713 non-null  float64
 8   critic_score     16713 non-null  float64
 9   user_score       16713 non-null  float64
 10  rating           16713 non-null  object 
 11  total_sales      16713 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 1.7+ MB


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,total_sales
16580,Irotoridori no Sekai: World's End Re-Birth,PSV,2015,Action,0.0,0.0,0.01,0.0,72.871111,6.475821,Rating Unknown,0.01
4174,Dragon Age Origins: Awakening,X360,2010,Role-Playing,0.33,0.1,0.0,0.04,67.482,6.093128,Rating Unknown,0.47
15271,Quiz Present Variety Q-Sama!! DS: Pressure Stu...,DS,2011,Misc,0.0,0.0,0.02,0.0,68.692,6.129688,Rating Unknown,0.02
2963,LEGO Batman 2: DC Super Heroes,DS,2012,Action,0.39,0.24,0.0,0.06,72.953125,8.0,E10+,0.69
6677,NBA Showtime: NBA on NBC,N64,1999,Sports,0.23,0.02,0.0,0.0,75.769231,7.764507,Rating Unknown,0.25
4112,Bass Pro Shops: The Strike,Wii,2009,Sports,0.44,0.0,0.0,0.03,67.554531,7.6,E,0.47
3465,Gex: Enter the Gecko,PS,1998,Platform,0.32,0.22,0.0,0.04,81.821429,8.506452,Rating Unknown,0.58
1231,Donkey Kong Country: Tropical Freeze,WiiU,2014,Platform,0.7,0.55,0.16,0.12,83.0,8.9,E,1.53
14319,Best Friends Tonight,DS,2010,Misc,0.03,0.0,0.0,0.0,67.482,4.989474,E10+,0.03
11264,Shin Megami Tensei: Devil Summoner - Raidou Ku...,PS2,2006,Role-Playing,0.0,0.0,0.09,0.0,67.33871,6.833128,Rating Unknown,0.09


<h2>Data Analysis</h2>

Look at how many games were released in different years. Is the data for every period significant?

In [15]:
games_per_year = ice_data.groupby('year_of_release')['name'].count()

print(games_per_year)

year_of_release
1980       9
1981      46
1982      36
1983      17
1984      14
1985      14
1986      21
1987      16
1988      15
1989      17
1990      16
1991      41
1992      43
1993      61
1994     121
1995     219
1996     263
1997     289
1998     379
1999     338
2000     350
2001     482
2002     830
2003     779
2004     764
2005     941
2006    1008
2007    1452
2008    1426
2009    1430
2010    1255
2011    1136
2012     652
2013     544
2014     581
2015     606
2016     502
Name: name, dtype: int64


It looks the data starts to be significant in 1994, when the count is over 100. It could be that data did not start to be reported consistently before that time.

Look at how sales varied from platform to platform. Choose the platforms with the greatest total sales and build a distribution based on data for each year. Find platforms that used to be popular but now have zero sales. How long does it generally take for new platforms to appear and old ones to fade?

In [16]:
#create bar chart of all total sales
grp = ice_data.groupby(['platform', 'year_of_release'])
sales_per_platform = grp['total_sales'].sum()
sales_per_platform = sales_per_platform.reset_index().rename(columns={0: 'platform', 1: 'year_of_release', 'total_sales': 'total_sales'})
sales_platform_bar = px.bar(sales_per_platform,
                            x='platform',
                            y='total_sales',
                            color='year_of_release',
                            labels={
                                    'platform': 'Platform',
                                    'total_sales': 'Total Sales (in USD millions)',
                                    'year_of_release': 'Year of Release'},                              
                            title='Total Sales (in USD millions) per Platform per Year')

#grab top platforms and create bar chat of their total sales only
sorted_by_total_sales = sales_per_platform.sort_values(by='total_sales', ascending=False)
top_platforms = sorted_by_total_sales['platform'].head(20).reset_index().rename(columns={0: 'platform'})
unique_platforms = top_platforms['platform'].unique()
unique_platforms_list = list(unique_platforms)
top_sales_platform = sales_per_platform[sales_per_platform['platform'].isin(unique_platforms_list)]
top_sales_platform_bar = px.bar(top_sales_platform,
                                x='platform',
                                y='total_sales',
                                color='year_of_release',
                                labels={
                                        'platform': 'Platforms with Greatest Total Sales',
                                        'total_sales': 'Total Sales (in USD millions)',
                                        'year_of_release': 'Year of Release'},                              
                                title='Total Annual Sales (in USD millions) for Platforms with Greatest Sales')

#platforms that were popular
grp2 = ice_data.groupby(['year_of_release', 'platform'])
formerly_popular = grp2['total_sales'].sum()
formerly_popular = formerly_popular.reset_index().rename(columns={0: 'year_of_release', 1: 'platform', 'total_sales': 'total_sales'})
formerly_popular_sorted = formerly_popular.sort_values(by='total_sales', ascending=False)
formerly_popular_sorted_scatter = px.scatter(formerly_popular_sorted,
                                     x='platform',
                                     y='year_of_release',
                                     color='total_sales',
                                     labels={
                                             'platform': 'Platform',
                                             'total_sales': 'Total Sales (in USD millions)',
                                             'year_of_release': 'Year of Release'},                              
                                     title="Platforms' Appearance and Disappearance in terms of Annual Total Sales (USD Millions)")

#display results
sales_platform_bar.show()
top_sales_platform_bar.show()
formerly_popular_sorted_scatter.show()
#print(top_sales_platform)
#print(formerly_popular_sorted)

Determine what period you should take data for. To do so, look at your answers to the previous questions. The data should allow you to build a prognosis for 2017.

Based on the platforms with greatest sales and the platforms' appearance and disappearance throughout the years, I think it would be best to look at the period 2005-2016.  

Work only with the data that you've decided is relevant. Disregard the data for previous years.

In [17]:
ice_data_filtered = ice_data.query("year_of_release > 2004 and year_of_release < 2017").reset_index()
display(ice_data_filtered.head())

Unnamed: 0,index,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,total_sales
0,0,Wii Sports,Wii,2006,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E,82.54
1,2,Mario Kart Wii,Wii,2008,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E,35.52
2,3,Wii Sports Resort,Wii,2009,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E,32.77
3,6,New Super Mario Bros.,DS,2006,Platform,11.28,9.14,6.5,2.88,89.0,8.5,E,29.8
4,7,Wii Play,Wii,2006,Misc,13.96,9.18,2.93,2.84,58.0,6.6,E,28.91


Which platforms are leading in sales? Which ones are growing or shrinking? Select several potentially profitable platforms.

In [18]:
leading_sales = ice_data_filtered.groupby('platform')['total_sales'].sum().sort_values(ascending=False)
leading_sales = leading_sales.reset_index().rename(columns={0: 'platform', 'total_sales': 'total_sales'})
print(leading_sales)
sales_growth = ice_data_filtered.groupby(['year_of_release', 'platform'])['total_sales'].sum()
sales_growth = sales_growth.reset_index().rename(columns={0: 'year_of_release', 1: 'platform', 'total_sales': 'total_sales'})
sales_growth_chart = px.scatter(sales_growth,
                                x='platform',
                                y='year_of_release',
                                color='total_sales',
                                labels={
                                        'platform': 'Platform',
                                        'total_sales': 'Total Sales (in USD millions)',
                                        'year_of_release': 'Year of Release'},                              
                                        title="Platforms' Sales Growth (USD Millions)"
    
)

sales_growth_chart.show()

   platform  total_sales
0      X360       971.42
1       PS3       939.64
2       Wii       907.51
3        DS       788.83
4       PS2       438.31
5       PS4       314.14
6       PSP       286.99
7       3DS       259.00
8        PC       171.55
9      XOne       159.32
10     WiiU        82.19
11       XB        65.08
12      PSV        54.07
13      GBA        47.51
14       GC        41.05
15     2600        10.50
16       PS         3.28
17       GB         1.03
18      N64         0.67
19       DC         0.06


XOne, PS4, 3DS led in sales in 2016.
PC and Wii have sharply dropped off with new streaming options on other game consoles. XBox 360 has decreased as new XBox versions have been released. Same with PS2 and PS3 now that PS4 is released.
Based on these numbers and the graph, I would say that XOne, PS4, 3DS, and potentially PSV could be profitable.


Build a box plot for the global sales of all games, broken down by platform. Are the differences in sales significant? What about average sales on various platforms? Describe your findings.

In [19]:
total_sales_mean = ice_data_filtered.groupby('platform')['total_sales'].mean()
total_sales_mean = total_sales_mean.reset_index().rename(columns={0: 'platform', 'total_sales': 'total_sales'})
sales_mean_box = px.box(total_sales_mean, 
                        x='total_sales', 
                        points='all', 
                        hover_data=['platform'],
                        labels={
                                'platform': 'Platform',
                                'total_sales': 'Average Sales (in USD millions)'},
                        title='Average Sales Distribution Boxplot')

sales_mean_box.show()


Take a look at how user and professional reviews affect sales for one popular platform (you choose). Build a scatter plot and calculate the correlation between reviews and sales. Draw conclusions.

In [42]:
def rating_sale_comparison (dataframe, scatter_title, critic_label, sale_label, name_label):
    critic_sale_scatter = px.scatter(dataframe,
                         x='critic_score',
                         y='total_sales',
                         color='name',
                         labels={
                                 'critic_score': critic_label,
                                 'total_sales': sale_label,
                                 'name': name_label},
                         title=scatter_title
                         )
    critic_sale_scatter.show()
    return(critic_sale_scatter)

ps4_data = ice_data[ice_data['platform']=='PS4']

rating_sale_comparison(ps4_data, 'PS4 Total Sales (in USD millions) Relative to Critic Score', 'Critic Score', 'Total Sales (in USD millions)', 'Game')
rating_sale_comparison(ps4_data, 'PS4 Total Sales (in USD millions) Relative to User Score', 'User Score', 'Total Sales (in USD millions)', 'Game')

ps4_critic_sale_corr = ps4_data['critic_score'].corr(ps4_data['total_sales'])
ps4_user_sale_corr = ps4_data['user_score'].corr(ps4_data['total_sales'])



print(ps4_critic_sale_corr)
print(ps4_user_sale_corr)

0.32400445820693186
-0.0008930190688675577


There was not a strong correlation between user or critic reviews and total sales.

Keeping your conclusions in mind, compare the sales of the same games on other platforms.

In [50]:
xone_data = ice_data[ice_data['platform']=='XOne']
rating_sale_comparison(xone_data, 'XOne Total Sales (in USD millions) Relative to Critic Score', 'Critic Score', 'Total Sales (in USD millions)', 'Game')
rating_sale_comparison(xone_data, 'XOne Total Sales (in USD millions) Relative to User Score', 'User Score', 'Total Sales (in USD millions)', 'Game')

xone_critic_sale_corr = xone_data['critic_score'].corr(xone_data['total_sales'])
xone_user_sale_corr = xone_data['user_score'].corr(xone_data['total_sales'])

threeds_data = ice_data[ice_data['platform']=='3DS']
rating_sale_comparison(threeds_data, '3DS Total Sales (in USD millions) Relative to Critic Score', 'Critic Score', 'Total Sales (in USD millions)', 'Game')
rating_sale_comparison(threeds_data, '3DS Total Sales (in USD millions) Relative to User Score', 'User Score', 'Total Sales (in USD millions)', 'Game')

threeds_critic_sale_corr = threeds_data['critic_score'].corr(threeds_data['total_sales'])
threeds_user_sale_corr = threeds_data['user_score'].corr(threeds_data['total_sales'])

print('XOne Critic Score Correlation: ' + str(xone_critic_sale_corr))
print('XOne User Score Correlation: ' + str(xone_user_sale_corr))
print('3DS Critic Score Correlation: ' + str(threeds_critic_sale_corr))
print('3DS User Score Correlation: ' + str(threeds_user_sale_corr))

XOne Critic Score Correlation: 0.34796494194599736
XOne User Score Correlation: -0.02174457147329062
3DS Critic Score Correlation: 0.185069813602906
3DS User Score Correlation: 0.18799652089993005


Take a look at the general distribution of games by genre. What can we say about the most profitable genres? Can you generalize about genres with high and low sales?

In [54]:
genre_sales = ice_data_filtered.groupby('genre')['total_sales'].sum()
genre_sales = genre_sales.reset_index().rename(columns={0: 'genre', 'total_sales': 'total_sales'})

genre_sales_bar = px.bar(genre_sales,
                                x='genre',
                                y='total_sales',
                                #color='year_of_release',
                                labels={
                                        'genre': 'Genre',
                                        'total_sales': 'Total Sales (in USD millions)',
                                        },                              
                                title='Total Annual Sales (in USD millions) by Genre')

#print(genre_sales)
genre_sales_bar.show()