<a href="https://colab.research.google.com/github/FatoniRahmat/Data-Analyst-in-Python/blob/main/Game.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup Environment & Run Packages

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

#Load Data

In [19]:
#Read data from google sheet
sheet_url = 'https://docs.google.com/spreadsheets/d/17rYlro20vaBo6P2pOD6stwV5_QTztXqnfxaAzTwOUxU/edit#gid=1485085913'
sheet_url_trf = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
df = pd.read_csv(sheet_url_trf)
df.head()

Unnamed: 0,Name,Sales,Series,Release,Genre,Developer,Publisher
0,PlayerUnknown's Battlegrounds,42.0,,12/1/2017,Battle royale,PUBG Studios,Krafton
1,Minecraft,33.0,Minecraft,11/1/2011,"Sandbox, survival",Mojang Studios,Mojang Studios
2,Diablo III,20.0,Diablo,5/1/2012,Action role-playing,Blizzard Entertainment,Blizzard Entertainment
3,Garry's Mod,20.0,,11/1/2006,Sandbox,Facepunch Studios,Valve
4,Terraria,17.2,,5/1/2011,Action-adventure,Re-Logic,Re-Logic


In [20]:
#Knowing the number of rows and columns
print("dataset has {} rows and {} columns".format(*df.shape))

dataset has 175 rows and 7 columns


#Basic steps
1. Check Variable Names

2. Check Data Type

3. Handle Missing Values

4. Check Duplicate Records

5. Checking Summary Statistics

##1. Check Variable Names
Column names provide a clear identification of the data contained within each column. By reviewing the column names, i can quickly understand the type of information stored in each column. This helps me make sense of the data and interpret it correctly.


In [21]:
#Check columns
df.columns

Index(['Name', 'Sales', 'Series', 'Release', 'Genre', 'Developer',
       'Publisher'],
      dtype='object')

Now, we change the order of the columns:

In [22]:
#Change the column order
col_order = ['Name', 'Series', 'Release', 'Genre', 'Developer', 'Publisher', 'Sales']
df = df[col_order]
pd.set_option('display.max_columns', None)
df.columns

Index(['Name', 'Series', 'Release', 'Genre', 'Developer', 'Publisher',
       'Sales'],
      dtype='object')

Now, we change name of the columns:

In [23]:
#Rename columns
df = df.rename(columns={'Sales':'Price($)',})

##2. Check Data Type
Checking the data types is an important step in data cleaning because it helps ensure the consistency, accuracy, and reliability of the data.

In [24]:
#Check data type
df.dtypes

Name          object
Series        object
Release       object
Genre         object
Developer     object
Publisher     object
Price($)     float64
dtype: object

Based on data documentation, we must check whether the data type of variables is correct or not. The result shows, there is something wrong about Release.

In [25]:
#Change data type
df['Release'] = pd.to_datetime(df.Release).dt.tz_localize(None)
df.dtypes

Name                 object
Series               object
Release      datetime64[ns]
Genre                object
Developer            object
Publisher            object
Price($)            float64
dtype: object

##3. Handle Missing Values
Checking for missing values is an important step in data cleaning because missing values can have a significant impact on the quality and reliability of the data analysis. Missing values can lead to incomplete or inaccurate data, which can skew the analysis and produce misleading results. By identifying and handling missing values appropriately, me ensure the integrity and reliability of the data.

In [26]:
#Check the amount of missing values
def nulls(df):
    null_values = pd.DataFrame(df.isnull().sum())
    null_values[1] = null_values[0] / len(df)
    null_values.columns = ['count','%pct']
    filtered_null = null_values[null_values['%pct'] > 0].sort_values(by='%pct', ascending=False)
    return filtered_null
nulls(df)

Unnamed: 0,count,%pct
Series,36,0.205714


To resolve the Missing Values case in the Series column, you can fill in the blank values with the game Name column because often the name of the game series matches the name of the game.

In [27]:
#Remove missing values
df['Series'].fillna(df['Name'], inplace=True)

##4. Check Duplicate Rows
Duplicate rows can compromise the integrity of the dataset. If we have multiple identical rows, it can lead to inaccurate statistical analysis, misleading results, and duplicate entries in downstream processes. By identifying and removing duplicate rows, we ensure that the data accurately represents the underlying information.

In [28]:
#Check for duplicate data
df.duplicated().sum()

0

No data duplicate

##5. Checking Summary Statistics

In [29]:
#Statistics of numerical columns
df.describe()

Unnamed: 0,Price($)
count,175.0
mean,3.141143
std,4.960513
min,1.0
25%,1.0
50%,1.5
75%,3.0
max,42.0


In [30]:
#Statistics of non-numerical columns
df.describe(include = np.object_)

Unnamed: 0,Name,Series,Genre,Developer,Publisher
count,175,175,175,175,175
unique,175,127,61,109,96
top,PlayerUnknown's Battlegrounds,Command & Conquer,Real-time strategy,Blizzard Entertainment,Electronic Arts
freq,1,5,24,8,19


We do not have any negative values.

#String
Trimming and Transformation

Text data often contains unwanted leading or trailing spaces, which can affect data integrity and analysis. Trimming these spaces ensures consistency and accuracy in subsequent operations.

In [31]:
#Trim spaces
df[['Name', 'Series', 'Genre', 'Developer', 'Publisher']] = df[['Name', 'Series', 'Genre', 'Developer', 'Publisher']].applymap(str.strip).applymap(str.upper)

#Export the File

In [32]:
#Show dataset
df.head()

Unnamed: 0,Name,Series,Release,Genre,Developer,Publisher,Price($)
0,PLAYERUNKNOWN'S BATTLEGROUNDS,PLAYERUNKNOWN'S BATTLEGROUNDS,2017-12-01,BATTLE ROYALE,PUBG STUDIOS,KRAFTON,42.0
1,MINECRAFT,MINECRAFT,2011-11-01,"SANDBOX, SURVIVAL",MOJANG STUDIOS,MOJANG STUDIOS,33.0
2,DIABLO III,DIABLO,2012-05-01,ACTION ROLE-PLAYING,BLIZZARD ENTERTAINMENT,BLIZZARD ENTERTAINMENT,20.0
3,GARRY'S MOD,GARRY'S MOD,2006-11-01,SANDBOX,FACEPUNCH STUDIOS,VALVE,20.0
4,TERRARIA,TERRARIA,2011-05-01,ACTION-ADVENTURE,RE-LOGIC,RE-LOGIC,17.2


In [33]:
#Knowing the number of rows and columns
print("dataset has {} rows and {} columns".format(*df.shape))

dataset has 175 rows and 7 columns


In [34]:
#Export data
df.to_csv('Games_Dataset_Cleaned.csv')