# **DATA SCIENCE CHEATSHEET WITH COVID19 DATASET**
Needed Skills for Data Scıence
1. Basic Tools: Python, R, SQL etc.
2. Basic Statistics: Such as mean, median, standart deviation etc.
3. Data Munging: Correcting messy and difficult data.
4. Data Visualization: Visualizing data with matplotlib, seaborn etc.
5. Machine Learning: What is the math behind and how to implement it.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #visualization tool
import seaborn as sns #visualization tool

In [None]:
df=pd.read_csv("../input/novel-corona-virus-2019-dataset/covid_19_data.csv")#to import covid_19_data
df.info()#to see basic information about the data frame

In [None]:
df.head(10)#to see the first 10 rows of the data frame

In [None]:
df.columns#to see the columns

In [None]:
#Correlation Map
df.corr()# to see correlation between rows without mapping

In [None]:
sns.heatmap(df.corr(),annot=True,linewidths=5,fmt=".3f")#mapped version of the correlation
plt.show()

# **PYTHON**
 **1.MATPLOTLIB**

Basic python library to plot lines, scatters and histograms.Allows us to customize the plots.(colors,labels,opacity,grid,figsize,tickets etc.)
1. Line plots are better when x is time.
2. Scatter is better when there is correlations between variables.
3. Histogram is better to see distribution of numerical data.


In [None]:
#Line Plot
df.Deaths.plot(kind="line",color="red",label="Deaths",grid=True,alpha=1)#alpha= to set opacity. rest defines itself
df.Confirmed.plot(color="blue",label="Confirmed",alpha=0.5)
plt.legend(loc="upper left")#to put label on the plot. loc=location
plt.xlabel("Time")#to label x axis
plt.ylabel("Number of Confirmed Cases and Deaths")
plt.title("Line Plot of Confirmed Cases and Deaths")
plt.show()

In [None]:
#Scatter Plot
df.plot(kind="scatter",x="Deaths",y="Recovered",alpha=0.1,color="blue")#It is not a good correlation example.But it shows us some errors of the dataset
plt.xlabel("Number of Deaths")
plt.ylabel("Recovered Cases")
plt.title("Correlation Between Number of Deaths and Recovered Cases")
plt.show()


In [None]:
#Histogram
df.Deaths.plot(kind="hist",bins=50,range=(0,250),density=True)#bins to set number of bars
plt.xlabel("Deaths")
plt.ylabel("Frequency")
plt.title("Number of Death Distribution")
plt.show()

In [None]:
#Subplots
df.plot(subplots=True)
plt.show

In [None]:
#Histogram Subplots
fig,axes=plt.subplots(nrows=2,ncols=1)
df.plot(kind="hist",y="Confirmed",bins=50,range=(0,250),ax=axes[0])
df.plot(kind="hist",y="Confirmed",bins=50,range=(0,250),ax=axes[1],cumulative=True)
plt.savefig("graph.png")
plt

**2.PANDAS**
* Fast and efficient for data frames
* Operations with files are easy (csv, text etc.)
* Handling missing and messy datas
* Reshaping data for more efficient usage
* Slicing and indexing data
* Analysing time series data

In [None]:
#Filterin Data Frame with Pandas
x=(df["Confirmed"]<50000)&(df["Deaths"]>10000)#too see the datas where confirmed cases are less than 50000 and number of deaths are more than 10000
df[x]

# CLEANING DATA
**DIAGNOSE DATA FOR CLEANING**
Befor exploring data, we need to diagnose and clean it.

Unclean Data:
* Column name incostistency like upper-lower case letter,space usage etc.
* Missing data
* Different language
* head(),tail(),columns(),shape(),info()

In [None]:
df.head()#to see first 5 rows

In [None]:
df.tail()#to see last 5 rows

In [None]:
df.shape#to see the number of the rows and columns

In [None]:
df.info()#to see the data types in data set.

**EXPLORATORY DATA ANALYSIS(EDA)**
value_counts:Frequency counts

outliers:the value that is considerably higher or lower from the rest of the data

describe() method includes:
* count:number of entries
* mean:average of entries
* std:standart deviation
* min:minimum entry
* max:maximum entry

In [None]:
print(df["Country/Region"].value_counts(dropna=False))#dropna=False:to also count NaN values

In [None]:
df.describe()#ignores NaN entries

# VISUAL EXPOLORATORY DATA ANALYSIS

Box plots: visualize basic statistics like outliers, min/max or quantiles

In [None]:
df1=df.head(13)#since the data set is huge, lets take first 13 entries to see boxplot clearly
df1.boxplot(column="Confirmed",by="Country/Region")
#from top the bottom, the horizontal lines:max,75%,median,25%,min
#the circles are outlier datas
plt.show()

**TIDY DATA**

To tidy data we use melt():
* melt() to unpivot given data frame from wide format to long format,optionally leaving identifier variables set

In [None]:
#lets use df1 to see melt()
df1=df.head()
df1

In [None]:
melted=pd.melt(frame=df1,id_vars="SNo",value_vars=["Confirmed","Deaths"])
#id_vars =what we don't want to melt
#value_vars=what we want to melt
melted

**PIVOTING DATA**

pivot() is simply reverse of melt

In [None]:
melted.pivot(index="SNo",columns="variable",values="value")

**CONCATENATING DATA**

In [None]:
#vertical concatenation
data1=df.head()
data2=df.tail()
#now we can concate two data frame
conc_data_row=pd.concat([data1,data2],axis=0,ignore_index=True)
#axis=0: to concat vertically
conc_data_row

In [None]:
#horizontal concatenation
data1=df["Confirmed"].head()
data2=df["Deaths"].head()
conc_data_col=pd.concat([data1,data2],axis=1)
#axis=1: to concat horizontally
conc_data_col

**DATA TYPES**

object(string),boolean,integer,float,categorical.

In [None]:
df.dtypes#to see the data types in our data frame

In [None]:
#to convert data
df["Province/State"]=df["Province/State"].astype("category")#from object to categorical
df.dtypes

# MISSING DATA AND TESTING WITH AN ASSERT

What to do when there is missing data
* leave as it is
* dropna() to drop them
* fillna() to fill missing values with NaN
* fill missing values with test statistics like mean

In [None]:
df["Province/State"].value_counts(dropna=False)#dropna=False:to see NaN values

In [None]:
df["Province/State"].dropna(inplace=True)#inplace=True: to not assign it to new variables

In [None]:
#assert df["Province/State"].notnull().all()
#i dont know why assert gives error since i drop Nan values with dropna()

# BUILDING DATA FRAMES FROM SCRATCH

* We can build dataframe from dictionaries.
    * zip():This function returns a list of tuples,where the i'th tuple contains the i'th element from each of the argument sequences or iterables.
    * Broadcasting:Creating new column and assign a value to entire column

In [None]:
#Creating data frames from dictionary
country=["Turkey","Germany"]
population=["80","55"]
list_label=["country","population"]
list_col=[country,population]
zipped=list(zip(list_label,list_col))
data_dict=dict(zipped)
df1=pd.DataFrame(data_dict)
df1

In [None]:
#Adding new columns
df1["Capital"]=["Ankara","Berlin"]
df1

In [None]:
#Broadcasting
df1["income"]=0
df1

# INDEXING PANDAS TIME SERIES

datetime=object parse_dates(boolean):Transforms date to ISO8601(yyyy-mm-dd hh:mm:ss) format

In [None]:
#Let's convert ObservationDate to datetime object
df["ObservationDate"]=pd.to_datetime(df["ObservationDate"])
df.dtypes

In [None]:
df=df.set_index("ObservationDate")#To make Observation Date as index

In [None]:
df

In [None]:
print(df.loc["2020-01-22"])#to select data due to date index

# RESAMPLING PANDAS TIME SERIES
* Resampling:statistical method over different time intervals("M"=month,"A"=year)
* Downsapmling:reduce date time rows to slower requency like from daily to weekly
* Upsampling:increase date time rows to faster frequency like from daily to hourly
* Interpolate:interpolate values according to different methods line "linear","time" or "index"

Interpolation is a statistical method by which related known values are used to estimate an unknown price or potential yield of a security. Interpolation is achieved by using other established values that are located in sequence with the unknown value.

In [None]:
df.resample("M").mean()#Resampling with months

In [None]:
df.resample("M").mean().interpolate("linear")#to interpolate due to means

# MANIPULATING DATA FRAMES WITH PANDAS

**INDEXING DATA FRAMES**

* Indexing using square brackets
* Using column attribute and row label
* Using loc accesor
* Selecting only some columns

In [None]:
#to not get effected from the codes we wrote, let's start over
data=pd.read_csv("../input/novel-corona-virus-2019-dataset/covid_19_data.csv")
data=data.set_index("SNo")#convert SNo to index
data.head()

In [None]:
#indexing using square brackets
data["Province/State"][3]

In [None]:
#indexing using column attribute and row label
data.Confirmed[2]

In [None]:
#indexing using loc accesor
data.loc[3,["Province/State"]]

In [None]:
#selecting only some columns
data[["Confirmed","Province/State"]]

**SLICING DATA FRAME**

* Difference between selecting columns
    * Series and data frames
* Slicing and indexing series
* Reverse slicing
* From something to end

In [None]:
#difference between selecting columns
print(type(data["Confirmed"]))#series
print(type(data[["Confirmed"]]))#data frame

In [None]:
#slicing and indexing
data.loc[1:10,"Province/State":"Confirmed"]

In [None]:
#reverse slicing
data.loc[10:1:-1,"Province/State":"Confirmed"]

In [None]:
#from sth to end
data.loc[1:10,"Confirmed":]

**FILTERING DATA FRAMES**

* Creating boolean series
* Combining filters
* Filtering column based others

In [None]:
#creating boolean series
boolean=data.Recovered<500
data[boolean]

In [None]:
#combining filters
first_filter=data.Recovered<500
second_filter=data.Confirmed<500
data[first_filter&second_filter]

In [None]:
#filtering column based others
data.Recovered[data.Confirmed>10000]#Recovered values when confirmed>10000

# TRANSFORMING DATA

* Plain python functions
* Lambda function
* Defining column using other columns

In [None]:
#plain python functions
def div(n):
    return n/2
data.Confirmed.apply(div)

In [None]:
#lambda function
data.Confirmed.apply(lambda n:n/2)

In [None]:
#defining columns using other columns
data["Recovered+Death"]=data.Deaths+data.Recovered
data.head()

# INDEX OBJECTS AND LABELED DATA

In [None]:
print(data.index.name)#to see the name of index

In [None]:
#Overwriting index
data2=data.copy()
data2.index=range(100,172580,1)#to change index range
data2.head()#now the index starts from 100

**HIERARCHICAL INDEXING**

In [None]:
data1=data.set_index(["ObservationDate","Deaths"])#ObservationDate is outer,Deaths is inner index
data1.head(50)

In [None]:
data1.loc["01/23/2020",1]#to select by indexes

**STACKING AND UNSTACKING DATA FRAME**

* Deal with multi label index
* level:position of unstacked index
* swaplevel:change inner and outer level index position

In [None]:
dic={"treatment":["A","A","B","B"],"gender":["F","M","F","M"],"response":[10,45,5,9],"age":[15,4,72,65]}
df1=pd.DataFrame(dic)
df1

In [None]:
#pivoting
df1.pivot(index="treatment",columns="gender",values="response")

In [None]:
df2=df1.set_index(["treatment","gender"])#to make treatment and gender our indexes
df2

In [None]:
df2.unstack(level=0)#level determines the index(treatment)

In [None]:
df2.unstack(level=1)#gender

In [None]:
#swap inner and outer index
df3=df2.swaplevel(0,1)
df3

**MELTING DATA FRAMES**

In [None]:
pd.melt(df1,id_vars="treatment",value_vars=["age","response"])

**CATEGORICALS AND GROUPY**

In [None]:
df1

In [None]:
df1.groupby("treatment").age.mean()#calculate the average age according to treatment