# DS Workshop Day 1 : Dealing with Data 


## Welcome to this data science workshop by [GeeksHub](https://www.facebook.com/GeeksHUB.eg) !!! 
(check out our page for more details)  &#128064;



### Our problem : predicting the rating of apps from the [Google Play Store Apps Dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps/code?datasetId=49864&language=Python&outputs=Visualization&tagIds=13201%2C16614)

### Day 1️⃣: Tips and Tricks for Data Preparation and Exploratory Data Analysis (1.5 – 2 hours)

* Set the stage for an exciting data science journey.

* Advanced Data Cleaning with Pandas.

* Techniques for handling missing data.

* Removing outliers and anomalies.

* Exploratory Data Analysis (EDA) with Matplotlib.

* Advanced plotting and visualization.

* Extracting insights from data.

### Day 2️⃣: Machine Learning Review (2 hours)

* Model training and selection.

* Understanding model evaluation and performance metrics.

* Selecting the best algorithm for a task.

* Extra: Hyperparameter Tuning.

* Extra: Optimizing model performance.

* Practical: Apply these Conceptston your Selected Dataset (1.5 hours)


### Day 3️⃣: Finalize our Project. Open discussion about the most common Technical Issues.

Our instructors will guide you through each topic, and you'll have the opportunity to apply your learning to real-world datasets, gaining valuable practical experience.




*This workshop is reviewed and supervised by Eng. Ahmed Abdelmalek – Senior NLP Engineer @WideBot - [Linkedin](https://www.linkedin.com/in/ahmed-abdelmalek/)*

*And taught by:*

*Mustafa Osama, NLP engineer @Widebot - [Linkedin](https://www.linkedin.com/in/mustafa-osama-164254232/)*

*Abdelrahman Mohamed, Clinical data analyst and Co-founder of GeeksHub- [Linkedin](https://www.linkedin.com/in/abdelrahman-mohamed-%F0%9F%87%B5%F0%9F%87%B8-210ab81b7/)*


# Day 1 : Tips and Tricks for Data Preparation and Exploratory Data Analysis 

![meme1](meme.jpeg)



**Remember to check the description of the dataset from the link provided to better understand the data we will be working on but accounting for the lazy ones here's a quick description of the dataset**  &#128064; 



*The dataset is produced by scraping of the google app store containing the following information about each app:*
    
   * App (name)
   * Category 
   * Rating
   * Reviews
   * Size
   * Installs (number of installation)
   * Type (free or paid)
   * Price
   * Content Rating (appropriate for which age group)
   * Genres (more than one genre can co-exist in one game)
   
   
   

### What will we do day 1 in a nutshell:
1. Cleaning the data (one problem at a time)
2. Exploring relationships in the data (exploratory data analysis)
3. Creating more meaningful visualisations (explanatory data analysis)

**The notebook will alternate from common-knowledge techniques and advanced methods as well as practice to be done by the students themselves**

In [9]:
## imports  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
from pylab import rcParams 
import statsmodels.api as sm

plt.style.use('default')


## 1. Cleaning the data : One Problem at a Time
### Essential questions to ask:
* are there duplicates?
* are there data entry anamolies?
* are the features saved in the appropriate datatype?
* are there missing values?
* is there uniformity in the formating of indvidual features? 
* are there meanningless features and what more meaningful features can we extract? 
* can we reduce the cardinality of highly cardinal variables? (ML-based question) (also what features should we include for our ML?)

### 1.0 General exploration of the data

In [None]:
## read the data 
data=pd.read_csv("googleplaystore.csv")
data.head()

In [None]:
print("There are {} observations and {} features in this dataset. \n".format(data.shape[0],data.shape[1]))
                                                                        

In [None]:
data.info()

*observations:* 
* there are missng values in Ratings column 
* Reviews and Price columns shouldn't be object but numerical instead 
* Last updated could be datetime type instead of object 

In [None]:
# A statistical summary for quantitative data
data.describe()

*observation* : there is an obvious outlier (19)

In [None]:
# The number of unique Applications
data['App'].nunique()

In [None]:
data.duplicated().sum()

*observation:* there are duplicates as some apps appear to be scraped twice 

In [None]:
data.dtypes

In [None]:
# For loop to find the Statistics of each Column and its Type.

for i in list(data.columns):
    
    print("\n ************ "+i+" ************\n")
    print("\n",data[i].value_counts())
    print("\n",data[i].describe(),"\n")

*observations:*


        (write you answer here)

### 1.1 Basic Cleaning 

In [None]:
### duplicates 
data.drop_duplicates(inplace=True)
data.duplicated().sum()

In [None]:
##  correct wrongly inputed values in Ratings,Installs,Category,and Type




In [None]:
print(data.Installs.value_counts(),"\n")
print(data.Category.value_counts(),"\n")
print(data.Type.value_counts(),"\n")

In [None]:
## datatypes 
## correcting Reviews, price and last updated 
data.loc[:,["Reviews","Price","Last Updated"]]

In [None]:
## correct the types in this cell

In [None]:
data.dtypes

In [None]:
data.loc[:,["Reviews","Price","Last Updated"]]

In [None]:
## missing values
data.isna().sum()

In [None]:
## Ratings is our main target varaible so the unlabeled rows must be removed  
data.dropna(subset=["Rating"],inplace=True)

In [None]:
## other columns we can use mode imputation 
data["Current Ver"]=data["Current Ver"].fillna(data["Current Ver"].mode())
data["Android Ver"]=data["Android Ver"].fillna(data["Android Ver"].mode())
data["Type"].fillna(data["Type"].mode(),inplace=True)

In [None]:
data.isna().sum()

In [None]:
## Why didn't fillna work?
## maybe the cells are empty but not saved as nan?


In [None]:
data.isna().sum()

## 1.2 Some Thoughtful cleaning

In [None]:
## are variables saved in a uniform format ?
data.head()

In [None]:
data["Content Rating"].value_counts()

In [None]:
## how do you think we can improve this ?



In [None]:
## more meaningful presentation of last updated column
data["Last Updated Year"] = 
data["Last Updated Season"] = 


In [None]:
# Converting the column "Size" to float
# There are sizes counted in mb, kb, in numbers without measurement unit and with "varies with device"

# Removing "M" which is the mb for the size


In [None]:
sb.displot(data.Size)

In [None]:
### -1 is just as NaN-- it indiactes missing value, hence we need to perform imputation 
### as this variable is MNAR and more than 5% is missing we will use end-of-distribution imputation 
## check Missing Values.ipynb for reference 


In [None]:
sb.displot(data.Size)

the distribution was distorted , we might consider trying removing this column as a whole when experimenting in ML section 

In [None]:
# To convert the column "Installs" into float
# So, firstly remove the "+"


In [None]:
## let us try something
example=data['Genres'].str.get_dummies(sep=";")
example

In [None]:
# which will lead  to more dimensional data: onehot encoding genre as above or will encoding the Category column?


it appears that dealing with category as is would be better in terms of dimensionality 

In [None]:
#def lower_cardinality(col,cut_off):
## create a function the lower cardinality of a variable given a cut-off frequency 
##where any variable repeated less than this cut off is replace by "OTHER"

In [None]:
data["Category"].value_counts()

In [None]:
lower_cardinality("Category",cut_off=300)
data["Category"].value_counts()

"Current Ver" and "Andriod Ver" seems to will have negative effect more than positive however there are a few ideas feel free to try them on your own:
1. creating a boolean variable for version being more than 1 
2. getting only the first number of the version rather than entire sequence

## 1.3 Visualisation

In [None]:
# Task Abstraction: Shows the count of each type of the applications (free or paid)
# Annotate the count on each bar



In [None]:
## create a visual to represent the most common genres from Genres column 

In [None]:
# Task Abstraction:Show the ratings for the most famous Five applications and show their categories.
# 1. Sorting in descending order The dataset by the number of installations.
# 2.Extract the TopN (5)
# 3.Construct this chart as X is the name of Applications and Y is the Rating of the the Top 7 
# 4.Using the color as Third variable  for the Caterories.



In [None]:
# Task Abstraction: Show the number of reviewers for the most famous Five applications.
# Visual ecoding by human Interaction: Notice... Is is that reasonable to make decisions regarding
# one or two -out of five applications- that showing they are the best one for stakeholders to invest in?
# 1.Show the Number (percentages) of Reviews by the sectors of the pie which represent the names of Top 5 applications  



In [None]:
# Treemap

# Task Abstraction:Decide which combination of content rating and catergory are the best for the the stakeholders or
# Reflect what the trends are in the mobile applications industry.

# 0:Grouping the dataset by content Rating and Category and aggregate the ratings as means.
# 1. Sorting This group by the Ratings in descending order .
# 2.Extract the TopN (20)
# 3.Construct this map as  the parent is  the content Rating and the child is the Categories. 
# 4.Using the color as Third variable  for as a grading system for the values of Ratings .

taFrame
reset_indx1 = Graph3.reset_index()

In [None]:
# Boxplot of 'Rating' variable


In [None]:
# Rating vs Count Bar Plot
# Sorted descendingly



In [None]:
# Create a heatmap for correlations
plt.show()

In [None]:
data.to_csv("data_cleaned.csv", index= False)