# <span style='color:#1ABC9C '>Data Cleaning</span>
*Data cleaning is an essential step in the data preprocessing pipeline for any data science or analytics project.
Messy, inconsistent, or missing data can lead to inaccurate insights and model predictions.* </br>
In this article, we’ll explore the fundamentals of data cleaning using Python and provide you with practical code examples.

## <span style='color:#E67E22 '>Why Cleaning Matters ?  </span>
Before diving into the code, let’s briefly discuss why data cleaning is crucial:

<span style='color:#BDC3C7 '>
    
1. **Accuracy:**  Clean data ensures that your analysis and machine learning models are based on accurate and reliable information.

2. **Consistency:**  Inconsistent data can lead to errors, especially when working with categorical variables, date formats, or units of measurement.
3. **Completeness:**  Missing data can cause issues in analysis and modeling. Handling missing values is an essential part of data cleaning.

</span>

## <span style='color:#E67E22 '>About Dataset </span>
The Dataset is about Steam Topsellers Action Games.

**It is crucial to have a greater understanding of dataset before doing anything**

- Name : Contains the name of games 
- Price : Price of Games in $
- Release_date : When was the game released
- Review_no : How many reviews were given to game
- Review_type : How was the Reviews 

('Very Positive', 'Mostly Positive', 'Mixed', 'Positive',
       'Overwhelmingly Positive', 'Mostly Negative', 'Very Negative',
       'Overwhelmingly Negative')
- Tags : The different tags given to the game e.g., Adventure,Fantasy etc
- Description : The description of Game


## <span style='color:#E67E22 '>Data Qualtiy Dimensions</span>
- **Completness** --> is data missing?
- **Validity** --> e.g height given in negative
- **Accuracy** --> height is +ive but is inaccurate e.g height of adult is 1m
- **Consistency** --> e.g in this data somewhere it is written New York and somewhere NY.

### <span style='color:#E67E22  '>Types of Assessment:</span>
1. *Manual Assessment* --> Look through the data manually in excel , googel sheets
2. *Programmatic Assessment* --> Use pandas functions (info,describe,etc) to get understanding of data

**In this Notebook will go through MANUAL ASSESEMENT** 

**The rest will be done when doing EDA**
## <span style='color:#E67E22  '>Issues with the Dataset:</span>

 
### <span style='color:#F39C12  '>Price</span>
- Price has $ sign
- Price has missing values
- Price has 'Prepurchase' and 'Free to Play'

### <span style='color:#F39C12  '>Review_no</span>
- Review no has 'User Reviews'(should be removed)
- Review no has ',' between numbers
- Review has missing values

### <span style='color:#F39C12  '>Review_type</span>
- Review type has 'Overwhelmingly Positive','Very Positive','Mixed','Mostly Positive','Very Negative'
- Review type has Null values

### <span style='color:#F39C12  '>Release_date</span>
- Date is well structred
- Date has Null values


In [2]:
# Importing Libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
games = pd.read_csv('steam_uncleaned.csv',index_col=0,na_values='Prepurchase')

In [5]:
games.head()

Unnamed: 0_level_0,Price,Release_date,Review_no,Review_type,Tags,Description
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Black Myth: Wukong,$59.99,"Aug 20, 2024","574,097 User Reviews",Overwhelmingly Positive,"Mythology,Action RPG,Action,Souls-like,RPG,Com...",Black Myth: Wukong is an action RPG rooted in ...
"Warhammer 40,000: Space Marine 2",,"Sep 9, 2024","23,591 User Reviews",Very Positive,"Warhammer 40K,Action,Adventure,Third-Person Sh...",Embody the superhuman skill and brutality of a...
Counter-Strike 2,Free To Play,"Aug 21, 2012","8,286,153 User Reviews",Very Positive,"FPS,Shooter,Multiplayer,Competitive,Action,Tea...","For over two decades, Counter-Strike has offer..."
Warframe,Free To Play,,"589,527 User Reviews",Very Positive,"Free to Play,Action RPG,Looter Shooter,Third-P...",Awaken as an unstoppable warrior and battle al...
Grand Theft Auto V,$10.48,"Apr 14, 2015","1,703,156 User Reviews",Very Positive,"Open World,Action,Multiplayer,Crime,Automobile...",Grand Theft Auto V for PC offers players the o...


In [4]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7812 entries, 0 to 7811
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          7812 non-null   object
 1   Price         7761 non-null   object
 2   Release_date  7793 non-null   object
 3   Review_no     7765 non-null   object
 4   Review_type   7765 non-null   object
 5   Tags          7812 non-null   object
 6   Description   7386 non-null   object
dtypes: object(7)
memory usage: 488.2+ KB


*The columns other than 'Name' and 'Tags' have missing valeus* 

### <span style='color:#F39C12  '>Price Column Solution</span>
- Removing the $ sign
- Making the Free to Play to 0
- Making the Prepurphase value to NaN (Done)

In [5]:
games['Price'] = games['Price'].str.replace('$','').str.strip()

In [6]:
games['Price'] = games['Price'].str.replace('Free To Play','0')

In [7]:
games['Price'] = round(games['Price'].astype(np.float64),2)

In [8]:
games.head()

Unnamed: 0,Name,Price,Release_date,Review_no,Review_type,Tags,Description
0,Black Myth: Wukong,59.99,"Aug 20, 2024","574,097 User Reviews",Overwhelmingly Positive,"Mythology,Action RPG,Action,Souls-like,RPG,Com...",Black Myth: Wukong is an action RPG rooted in ...
1,"Warhammer 40,000: Space Marine 2",,"Sep 9, 2024","23,591 User Reviews",Very Positive,"Warhammer 40K,Action,Adventure,Third-Person Sh...",Embody the superhuman skill and brutality of a...
2,Counter-Strike 2,0.0,"Aug 21, 2012","8,286,153 User Reviews",Very Positive,"FPS,Shooter,Multiplayer,Competitive,Action,Tea...","For over two decades, Counter-Strike has offer..."
3,Warframe,0.0,,"589,527 User Reviews",Very Positive,"Free to Play,Action RPG,Looter Shooter,Third-P...",Awaken as an unstoppable warrior and battle al...
4,Grand Theft Auto V,10.48,"Apr 14, 2015","1,703,156 User Reviews",Very Positive,"Open World,Action,Multiplayer,Crime,Automobile...",Grand Theft Auto V for PC offers players the o...


### <span style='color:#F39C12  '>Release_date Column Solution</span>
- Changing the DataType

In [9]:
games['Release_date'] = pd.to_datetime(games['Release_date'])

In [10]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7812 entries, 0 to 7811
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Name          7812 non-null   object        
 1   Price         7761 non-null   float64       
 2   Release_date  7793 non-null   datetime64[ns]
 3   Review_no     7765 non-null   object        
 4   Review_type   7765 non-null   object        
 5   Tags          7812 non-null   object        
 6   Description   7386 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 488.2+ KB


### <span style='color:#F39C12  '>Review_no Column Solution</span>
- Removing the ','
- Removing  the 'User Reviews'

In [11]:
games['Review_no'] = games['Review_no'].str.replace('User Reviews','').str.strip()

In [13]:
games['Review_no'] = games['Review_no'].str.replace(',','').str.strip()

In [14]:
games.head()

Unnamed: 0,Name,Price,Release_date,Review_no,Review_type,Tags,Description
0,Black Myth: Wukong,59.99,2024-08-20,574097,Overwhelmingly Positive,"Mythology,Action RPG,Action,Souls-like,RPG,Com...",Black Myth: Wukong is an action RPG rooted in ...
1,"Warhammer 40,000: Space Marine 2",,2024-09-09,23591,Very Positive,"Warhammer 40K,Action,Adventure,Third-Person Sh...",Embody the superhuman skill and brutality of a...
2,Counter-Strike 2,0.0,2012-08-21,8286153,Very Positive,"FPS,Shooter,Multiplayer,Competitive,Action,Tea...","For over two decades, Counter-Strike has offer..."
3,Warframe,0.0,NaT,589527,Very Positive,"Free to Play,Action RPG,Looter Shooter,Third-P...",Awaken as an unstoppable warrior and battle al...
4,Grand Theft Auto V,10.48,2015-04-14,1703156,Very Positive,"Open World,Action,Multiplayer,Crime,Automobile...",Grand Theft Auto V for PC offers players the o...


In [21]:
games.to_csv('steam_cleaned.csv',index=False)

### <span style='color:#1ABC9C '>Conclusion</span>
<span style='color:#BDC3C7 '>
    
- The missing values cannot be filled until we have done EDA on the dataset and got understanding of the Dataset. 
- Need to do some work on Tag's column but first we need to do EDA
- There is not to much to do with this dataset . Only 'Tags' and 'Description' can be transformed but we will to that in EDA Notebook
</span>