## Introduction
> In this section, we will describe the dataset we are doing to use for this assignment. Include a link to the source of this data. You also provide some explanation on why you chose this dataset.

In [1]:
import kaggle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Exploration
> Import your dataset into .ipynb, create dataframes and explore your data
> Include:
    - Summary Statistics
    - Missing value information
    - Any other revelant information about the dataset

In [2]:
kaggle.api.authenticate()
kaggle.api.dataset_download_files('ulrikthygepedersen/video-games-sales', path='./', unzip=True)
df = pd.read_csv("video_games_sales.csv")

ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Content-Length': '0', 'Date': 'Mon, 15 May 2023 20:12:59 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie': 'ka_sessionid=c3b221575ba4f790c5235802c43e4bad; max-age=2626560; path=/, GCLB=CO-CkZ6Iuo3_5QE; path=/; HttpOnly', 'Turbolinks-Location': 'https://www.kaggle.com/api/v1/datasets/download/ulrikthygepedersen/video-games-sales', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload', 'Content-Security-Policy': "object-src 'none'; script-src 'nonce-JDMFtjrjxA1p0qe7h2RCuA==' 'report-sample' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:; frame-src 'self' https://www.kaggleusercontent.com https://www.youtube.com/embed/ https://polygraph-cool.github.io https://www.google.com/recaptcha/ https://form.jotform.com https://submit.jotform.us https://submit.jotformpro.com https://submit.jotform.com https://www.docdroid.com https://www.docdroid.net https://kaggle-static.storage.googleapis.com https://kaggle-static-staging.storage.googleapis.com https://kkb-dev.jupyter-proxy.kaggle.net https://kkb-staging.jupyter-proxy.kaggle.net https://kkb-production.jupyter-proxy.kaggle.net https://kkb-dev.firebaseapp.com https://kkb-staging.firebaseapp.com https://kkb-production.firebaseapp.com https://kaggle-metastore-test.firebaseapp.com https://kaggle-metastore.firebaseapp.com https://apis.google.com https://content-sheets.googleapis.com/ https://accounts.google.com/ https://storage.googleapis.com https://docs.google.com https://drive.google.com https://calendar.google.com/; base-uri 'none'; report-uri https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-Options': 'nosniff', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})


In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df[['platform', 'genre', 'publisher']].apply(lambda x: x.unique())

In [None]:
df["genre"].value_counts()

In [None]:
df["platform"].value_counts()

## Data Wrangling
> Perform data wrangling. You are free to use your best judgment here


In [None]:
df = df.dropna()

Sub-setting columns to have dataframe for each region.

In [None]:
na_df = df[["name", "platform", "year", "genre", "publisher", "na_sales"]]
eu_df = df[["name", "platform", "year", "genre", "publisher", "eu_sales"]]
jp_df = df[["name", "platform", "year", "genre", "publisher", "jp_sales"]]
other_df = df[["name", "platform", "year", "genre", "publisher", "other_sales"]]

In [None]:
def update_sales_name(df, region_sales: str):
    df.rename(columns={region_sales: "sales"}, inplace=True)
    return df

In [None]:
na_df = update_sales_name(na_df, "na_sales")
eu_df = update_sales_name(eu_df, "eu_sales")
jp_df = update_sales_name(jp_df, "jp_sales")
other_df = update_sales_name(other_df, "other_sales")

In [None]:
def annual_best_seller(df) :
    return df.sort_values(by = "sales", ascending = False).groupby("year").head(10).reset_index(drop = True)

In [None]:
na_best_seller = annual_best_seller(na_df)
eu_best_seller = annual_best_seller(eu_df)
jp_best_seller = annual_best_seller(jp_df)
other_best_seller = annual_best_seller(other_df)

## Data Visualizations: Matplotlib and Seaborn
> The following plots will show case:
>> Using and changing the legend position
>> Changing the title and x/y axis
>> Changing the marker, line colors and line width
>> Modifying axis ticks/labels
>> Changing size of axis labels


### What are the most popular genre in each region?

#### Matplotlib

In [None]:
def plot_genre_sales(df, region: str):
    genre_counts = df.groupby("genre")["sales"].sum().sort_values(ascending=True)
    plt.barh(genre_counts.index, genre_counts)
    plt.ylabel("Game Genre", fontsize = 10)
    plt.xlabel("Sales (in Millions)", fontsize = 20)
    plt.title("{}: Game Genre vs Sales".format(region))
    plt.show()

In [None]:
regions_df = {"NA": na_df, "EU": eu_df, "JP":jp_df, "Other": other_df}
for region in regions_df:
    plot_genre_sales(regions_df[region], region)

####  Seaborn

In [None]:
def seaborn_plot_genre_sales(df, region: str):
    genre_counts = df.groupby("genre")["sales"].sum().sort_values(ascending=False)
    sns.barplot(x=genre_counts, y=genre_counts.index)
    plt.xlabel("Sales (in Millions)", fontsize=20)
    plt.ylabel("Game Genre", fontsize=10)
    plt.title("{}: Game Genre vs Sales".format(region))
    plt.show()

In [None]:
for region in regions_df:
    seaborn_plot_genre_sales(regions_df[region], region)

The above plots showcase the popularity of game genres in various regions, namely North America, Europe, Japan and Others. The plots were created using matplotlib and seaborn libraries in Python.

To improve the readability of the plots, we used the function plt.title() to give each plot an informative title that tells the reader what the plot is about. Additionally, we labeled the x and y axis of the plot using plt.xlabel() and plt.ylabel() respectively. These labels help the reader to understand the values and units represented by the plot.

To enhance the aesthetic appeal of the plot, we also changed the fontsize of the axis labels using the fontsize argument in plt.xlabel() and plt.ylabel().

### What is the general trend for annual sales?

#### Matplotlib

In [None]:
na_df.groupby('year')['sales'].sum().plot(marker = "s", linewidth = 2.5, color = "gold")
plt.title("NA Annual Sales")

plt.show()

#### Seaborn

In [None]:
sns.lineplot(data=na_df.groupby('year')['sales'].sum(), marker='s', linewidth=2.5, color='gold')
plt.title('NA Annual Sales')
plt.show()

Line Graph: Annual Sales over Time per region. Year vs Sales (in Millions)

In [None]:
sales = df.groupby('year')[["na_sales", "eu_sales", "jp_sales"]].sum()

sales.plot()
plt.xlabel("Year")
plt.ylabel("Sales(in Millions)")
plt.title("Annual Sales per Region (1980-2020)")

# Changed legend position from upper right to upper left
plt.legend(loc="upper left")
plt.show()

In [None]:
sales = df.groupby('year')[["na_sales", "eu_sales", "jp_sales"]].sum()

sns.lineplot(data = sales)
plt.title("Annual Sales per Region (1980-2020)")

# Changed legend position from upper right to upper left
plt.legend(loc="upper left")
plt.show()

In the above visualizations, we have demonstrated the use of Matplotlib and Seaborn to display the annual sales of video games in North America, Europe, Japan, and other regions. Using the marker argument, we were able to change the marker style, line size, and color, showcasing the ability to customize the visualizations to better portray the data.

By using different marker styles, we can enhance the visual appeal of the plot and distinguish each line from one another. Additionally, the ability to change line size and color allows us to highlight the important information and better communicate our findings to the audience.

### What year had the most sales globally?

In [None]:
sales_by_year = df.groupby('year')['global_sales'].sum().reset_index()

# create a bar chart of total sales by year
plt.bar(sales_by_year['year'], sales_by_year['global_sales'])

# add axis labels and a title
plt.xlabel('Year')
plt.ylabel('Global Sales (in Millions)')
plt.title('Total Global Sales by Year')

# show the plot
plt.show()

In [None]:
sales_by_year.head()

In [None]:
sns.barplot(x='year', y='global_sales', data=sales_by_year)
plt.xticks(range(sales_by_year.index.min(), sales_by_year.index.max(), 5))

# add axis labels and a title
plt.xlabel('Year')
plt.ylabel('Global Sales (in Millions)')
plt.title('Total Global Sales by Year')

# show the plot
plt.show()


The bar graph displayed shows the global sales of video games by year, with the year 2008 having the highest sales. We used seaborn to plot the graph but encountered an issue where the xticks were too close together, making the graph difficult to read. To fix this issue, we manually set the xticks to increment by 5 years. This allowed us to display the data in a more readable format, making it easier to understand and analyze the sales data over the years. By adjusting the xticks, we were able to better display the data and make it more accessible for the audience. This showcases the importance of proper formatting and labeling in data visualization.

## Conclusions and Findings

The genre of "Action" games is the most popular across North America, Europe and Other regions while "Role Playing" games (RPG) are the most popular genre in Japan. This information could be valuable to game developers as they can tailor their game development to the preferred genre of the region they are targeting, potentially increasing sales.

Moreover, the year 2008 seems to be a standout year for video game sales globally. This information could be useful for gaming companies to analyze and determine what made that year so successful and try to replicate it in the future. Additionally, it could be interesting to see if there is any correlation between the most popular game genre and the year with the most sales. This could lead to potential insights into consumer behavior and preferences.