# Data Literacy, Pandas Library & Exploration Data Analysis(EDA)

In this section we will cover the topics such as; how to handle and process the data, how to read it, how to visualize the data, what can we deduct from a dataset after the processes that we made and at the end we apply EDA(Exploration Data Analysis) on a valid dataset.

* [Data Types](#datatypes)
* [Measures of Central Tendency](#measures_of_central_tendency)
    * [Measures of Central Tendency In Practice](#measures_of_central_tendency_in_practice)
* [Measures of Dispersion](#measures_of_dispersion)
    * [Measures of Dispersion In Practice](#measures_of_dispersion_in_practice)
* [Essentials of Pandas](#pandas)
    * [Data Frames and Series](#dfs)
    * [Indexing a Data Frame](#index)
    * [Selecting a Specific Part of a Data Frame (iloc and loc)](#iloc_loc)
    * [Modfying Columns and Adding New Columns](#cols)
    * [Resetting the Data Frame](#resetting)
    * [cut and qcut Methods of Pandas](#cutting)
    * [Deleting Rows and Columns](#del_col)
    * [Missing Values](#missing)
    * [Groupby Method](#groupby)
    * [Pivot Tables in Pandas](#pivot)

<a id="datatypes"></a>
# Data Types

* **Qualitative Variable (Categorical Variables)**
  <br>Qualitative data represents the data type that can be categorized:
    * Gender
    * Color
    * Breed
    * etc...
* **Quantitative Variables (Numerical Variables)**
  <br>Quantitative data represents the data type that arithmetic operators can be applied on
    * Age
    * Income
    * Temperature
    * etc...

_____

These **categorical variables** and **numerical variables** branch into two different categories:

* **Categorical Variables**
    * Nominal
    <br>Nominal variables categorizes itself without any order, such as:
        * Gender
        * Color
        * etc...
    * Ordinal
    <br>Ordinal variables categorizes itself with a specific order, such as:
        * Grade
        * Educational Status
        * Military Rank
        * etc...
* **Numerical Variables**
    * Interval
    <br>In interval scale; numerical differences have a meaning, arithmetic operators can be applied on and there is no absolute zero. Such as:
        * Celcius (C)
        * Fahrenheit (F)
        * etc...
    * Ratio
    <br>In interval scale; numerical differences have a meaning, arithmetic operators can be applied on but there is an absolute zero. Such as:
        * Age
        * Length
        * etc...
_____

In summary;

**Data** branches into two:

* **Qualitative Variable (Categorical Variables)**
    * Nominal
    * Ordinal
* **Quantitative Variables (Numerical Variables)**
    * Interval
    * Ratio

<a id="measures_of_central_tendency"></a>
# Measures of Central Tendency

Measures of central tendency is central statistical values of a data distribution. The purpose of it to comprehend and summarize a dataset.

To calculate such measurements in python, we benefit from some libraries like: `numpy` ,`pandas`, `scipy`...

For now, `numpy` and `pandas` are enough.

* **Mean**<br>
The sum of all measurements divided by the number of observations in the data set.
* **Median**<br>
The middle value that separates the higher half from the lower half of the data set.
* **Mode**<br>
The most frequent value in the data set.
* **Quartiles**<br>
Type of quantiles which divide the number of data points into four parts, or quarters.




<a id="measures_of_central_tendency_in_practice"></a>
# Measures of Central Tendency In Practice

In [1]:
# Required Libraries
import numpy as np
import pandas as pd

In [2]:
# Creating an array contains random integers
random_array = np.random.randint(10, size=(10))
random_array

array([9, 1, 9, 9, 8, 5, 1, 7, 6, 6])

In [3]:
# Creating a dataframe to apply measurements
s = pd.DataFrame(random_array)

In [4]:
# Mean Value
s.mean()

Unnamed: 0,0
0,6.1


In [5]:
# Median Value
s.median()

Unnamed: 0,0
0,6.5


In [6]:
# Mode Value
s.mode()

Unnamed: 0,0
0,9


In [7]:
# Quartiles
s.quantile(q=[0.25, 0.50, 0.75])

Unnamed: 0,0
0.25,5.25
0.5,6.5
0.75,8.75


<a id="measures_of_dispersion"></a>
# Measures of Dispersion

Measures of dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.

* **Range**<br>
The range of a set of data is size of the narrowest interval which contains all the data.
* **Standard Deviation**<br>
The standard deviation is a measure of the amount of variation of the values of a variable about its mean.
* **Variance**<br>
The expected value of the squared deviation from the mean of a random variable.
* **Interquartile Range(IQR)**<br>
The interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data.

<a id="measures_of_dispersion_in_practice"></a>
# Measures of Dispersion In Practice

In [8]:
# Range
range = s.max() - s.min()
range

Unnamed: 0,0
0,8


In [9]:
# Standard Deviation
s.std()

Unnamed: 0,0
0,3.034981


In [10]:
# Variance
s.var()

Unnamed: 0,0
0,9.211111


In [11]:
# Quartiles
quartiles = s.quantile(q=[0.25, 0.50, 0.75])
# IQR
IQR = quartiles.iloc[2] - quartiles.iloc[0]
IQR

Unnamed: 0,0
0,3.5


<a id="pandas"></a>
# Essentials of Pandas Library

In [12]:
# Importing required libraries
import numpy as np
import pandas as pd
import seaborn as sns # to import a dataset and visualize the dataset
import matplotlib.pyplot as plt # to visualize the dataset

<a id="dfs"></a>
## Data Frames and Series

First, we should create a dataframe or series to `use pandas essentially`. Dataframes and series can be created by the user or from an available dataset.

Main `difference between dataframes and series` is dimensions of them. Series has 1 dimension yet dataframe has 2. This explanation makes more sense as the section progress.

### Creating Series

In [13]:
# Creating a Series with a List object
s = pd.Series(data=[1, 2, 3, 4, 5, 6, 7, 8, 9])
s

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9


In [14]:
array = np.random.randint(100, size=(10))

# Creating a Series from a numpy array
s = pd.Series(array)
s

Unnamed: 0,0
0,55
1,92
2,39
3,63
4,80
5,66
6,51
7,69
8,5
9,89


In [15]:
dict_a = {"A": 1, "B": 2, "C": 3, "D": 4}

# Creating a series with a dictionary
s = pd.Series(dict_a)
s

Unnamed: 0,0
A,1
B,2
C,3
D,4


In [16]:
list_index = ["A", "B", "C", "D"]
list_data = [1, 2, 3, 4]

# Creating a series with two separate lists
s = pd.Series(data=list_data, index=list_index)
s

Unnamed: 0,0
A,1
B,2
C,3
D,4


### Creating Dataframes

A dataframe can easily be created by reading a file(.xlsx, .csv, .json, .sql, .pickle, .txt) as long as the **correct path** is given:

    df = pd.read_excel("data.xlsx")
    df = pd.read_csv("data.csv")
    df = pd.read_json("data.json")
    df = pd.read_sql("data.sql")
    df = pd.read_pickle("data.pickle")
    df = pd.read_csv("data.txt")

#### Creating Dataframe with Seaborn

In [17]:
# Importing a dataset from seaborn library
titanic_dataset = sns.load_dataset("titanic")

# Creating a DataFrame from the copy of the dataset
df = titanic_dataset.copy()

# We call `head()` function to check the *df* dataframe
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


|  *Features*=>  |   survived |   pclass | sex    |   age |   sibsp |   parch |    fare | embarked   | class   | who   | adult_male   | deck   | embark_town   | alive   | alone   |
|---:|-----------:|---------:|:-------|------:|--------:|--------:|--------:|:-----------|:--------|:------|:-------------|:-------|:--------------|:--------|:--------|
|  Results of one sample=> |          0 |        3 | male   |    22 |       1 |       0 |  7.25   | S          | Third   | man   | True         | nan    | Southampton   | no      | False   |
|  1 |          1 |        1 | female |    38 |       1 |       0 | 71.2833 | C          | First   | woman | False        | C      | Cherbourg     | yes     | False   |
|  2 |          1 |        3 | female |    26 |       0 |       0 |  7.925  | S          | Third   | woman | False        | nan    | Southampton   | yes     | True    |
|  3 |          1 |        1 | female |    35 |       1 |       0 | 53.1    | S          | First   | woman | False        | C      | Southampton   | yes     | False   |
|  4 |          0 |        3 | male   |    35 |       0 |       0 |  8.05   | S          | Third   | man   | True         | nan    | Southampton   | no      | True    |

<a id="index"></a>
## Indexing a Data Frame

### Choosing Single Feature

In [18]:
# choosing 'age' as series from df
df["age"]

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,
889,26.0


In [19]:
# another method to choose 'age' as series from df
df.age

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,
889,26.0


In [20]:
# choosing 'age' as a dataframe from df
df[["age"]]

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,
889,26.0


**Important Note:**

As it can be easily noticed; a feature can be called with many different methods. Main difference of using double brackets: `[["age"]]` is to define the called object as a dataframe. Also if we want to call multiple features, we have to call it as a dataframe: `[[]]` not as a series `[]`

In [21]:
# Choosing as series
print(type(df["age"]))

# Choosing as a data frame
print(type(df[["age"]]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


### Choosing Multiple Features

In [22]:
df[["age", "sex"]]

Unnamed: 0,age,sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
...,...,...
886,27.0,male
887,19.0,female
888,,female
889,26.0,male


### Fancy Indexing

We write an array inside of our `df` to call mutliple features, we can do the same thing by giving an array that is created outside of it.

In [23]:
# a list that contains our columns that we want to observe
columns = ["survived", "sex", "age"]

# fancy indexing
df[columns]

Unnamed: 0,survived,sex,age
0,0,male,22.0
1,1,female,38.0
2,1,female,26.0
3,1,female,35.0
4,0,male,35.0
...,...,...,...
886,0,male,27.0
887,1,female,19.0
888,0,female,
889,1,male,26.0


<a id="iloc_loc"></a>
## Selecting Specific Parts of a Data Frame via **.iloc** and **.loc**

Specific parts of a data frame can be selected with the methods of pandas called **iloc** and **loc**. It is similar to the base slicing of Python

**iloc:**

Specific parts of a data frame can be selected with **iloc** via slicing the data frame with integers.

**loc:**

With loc on the other hand; specific parts can be selected with the conditions defined in `loc` method. (The data will be returned as the condition is met.)

### Index Based Selection (iloc Method)

It uses integers to slice the data frame

In [24]:
# Let's check our data frame before using iloc method
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [25]:
df.iloc[0, 2] # this expression selects the 0 => first row, 2 => third column

'male'

In [26]:
df.iloc[:, 0] # : => all rows, 0 => first column

Unnamed: 0,survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,1
888,0
889,1


In [27]:
df.iloc[0, :] # 0 => first row, : => all columns

Unnamed: 0,0
survived,0
pclass,3
sex,male
age,22.0
sibsp,1
parch,0
fare,7.25
embarked,S
class,Third
who,man


In [28]:
df.iloc[:, :] # all rows, all columns

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [29]:
df.iloc[:: -1] # all rows and all columns but reversed

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
890,0,3,male,32.0,0,0,7.7500,Q,Third,man,True,,Queenstown,no,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


In [30]:
df.iloc[::2] # all rows that has even index numbers and all columns

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
884,0,3,male,25.0,0,0,7.0500,S,Third,man,True,,Southampton,no,True
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False


In [31]:
df.iloc[0:3, 0:3] # first three rows and first three columns

Unnamed: 0,survived,pclass,sex
0,0,3,male
1,1,1,female
2,1,3,female


### Label Based Selection (loc method)

It uses feature names and conditions to slice a data frame

In [32]:
df.loc[:, "age"] # all rows of `age` column

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,
889,26.0


In [33]:
df.loc[df["age"] > 40] # all columns where `age` is greater than 40

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
15,1,2,female,55.0,0,0,16.0000,S,Second,woman,False,,Southampton,yes,True
33,0,2,male,66.0,0,0,10.5000,S,Second,man,True,,Southampton,no,True
35,0,1,male,42.0,1,0,52.0000,S,First,man,True,,Southampton,no,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
862,1,1,female,48.0,0,0,25.9292,S,First,woman,False,D,Southampton,yes,True
865,1,2,female,42.0,0,0,13.0000,S,Second,woman,False,,Southampton,yes,True
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
873,0,3,male,47.0,0,0,9.0000,S,Third,man,True,,Southampton,no,True


In [34]:
df.loc[df["age"] > 40, ["survived", "sex", "fare"]] # survived, sex, and fare columns where the `age > 40` condition is met

Unnamed: 0,survived,sex,fare
6,0,male,51.8625
11,1,female,26.5500
15,1,female,16.0000
33,0,male,10.5000
35,0,male,52.0000
...,...,...,...
862,1,female,25.9292
865,1,female,13.0000
871,1,female,52.5542
873,0,male,9.0000


In [35]:
# `&`, `|` operators can be used to apply multiple conditions in selection

df.loc[(df["age"] > 40)
        & (df["sex"]=="male")
        & ((df["embark_town"]=="Southampton") | (df["embark_town"]=="Cherbourg")),
        ["survived", "pclass", "age", "fare"]]
# we want to observe `survived, pclass, age, fare` columns where age is greater than 40 `AND` sex is male `AND` embark town is Southampton `OR` Cherbourg

Unnamed: 0,survived,pclass,age,fare
6,0,1,54.0,51.8625
33,0,2,66.0,10.5000
35,0,1,42.0,52.0000
54,0,1,65.0,61.9792
62,0,1,45.0,83.4750
...,...,...,...,...
845,0,3,42.0,7.5500
851,0,3,74.0,7.7750
857,1,1,51.0,26.5500
860,0,3,41.0,14.1083


In [36]:
df.loc[(df["age"] > 40)
       & (df["sex"] == "male")
       & ((df["embark_town"] == "Cherbourg") | (df["embark_town"] == "Southampton")),
       ["age", "class", "embark_town"]]

Unnamed: 0,age,class,embark_town
6,54.0,First,Southampton
33,66.0,Second,Southampton
35,42.0,First,Southampton
54,65.0,First,Cherbourg
62,45.0,First,Southampton
...,...,...,...
845,42.0,Third,Southampton
851,74.0,Third,Southampton
857,51.0,First,Southampton
860,41.0,Third,Southampton


In [37]:
df.loc[:, df.columns.str.contains("age")].head()  # selecting the values that has `age` in it's name

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


In [38]:
df.loc[:, ~df.columns.str.contains("age")].head()  # selecting the values that DOES NOT have `age` in it's name

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,0,0,8.05,S,Third,man,True,,Southampton,no,True


<a id="cols"></a>
## Modifying Columns and Adding New Columns

Current columns can be modified or even new ones can be added.

    df["new_column_name"] = new_condition
    df["currrent_column] = new_condition

Let's create a `new age` column that is substracted from the current year:

In [39]:
import datetime

current_date = datetime.date.today()
year = current_date.strftime("%Y")
year = int(year)
year

2024

In [40]:
df["new_age"] = year - df["age"]
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,new_age
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,2002.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1986.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,1998.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1989.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1989.0


### Setting a Column as Index

In [41]:
# to set it as an index in our dataframe we need `set_index` method and set `inplace` parameter as True to apply it on df
df.set_index("new_age", inplace=True)
df.head()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
new_age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2002.0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1986.0,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
1998.0,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
1989.0,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
1989.0,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [42]:
# we can also change the name of the index
df.index.names = ["new index"]
df.head()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2002.0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1986.0,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
1998.0,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
1989.0,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
1989.0,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


<a id="resetting"></a>
## Resetting the Data Frame

In [43]:
# Sometimes we have to reset the data frame because it becomes unusable as we mess around with it ,but; we can reset the df as we have a copy of the main data frame as `titanic_dataset` variable
df = titanic_dataset.copy()
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


<a id="cutting"></a>
## cut and qcut Methods of Pandas

Pandas have `cut` and `qcut` methods when we need to cut a column up in parts to categorize it.

### **cut Method**
**Parameters:**
* **x:** the data to be cut
* **bins:** main condition to cut
* **labels:** assigns labels for the cut (uses conditions if not defined)
* **right:** decides whether highest value is included or not
* **include_lowest:** self explanatory

____

### **qcut Method**

The main difference from `cut` is that `qcut` divides the selected column as given quantiles.

**Parameters:**
* **x:** the data to be cut
* **q:** the quantiles as a list
* **labels:** assigns labels for the cut (uses conditions if not defined)


In [44]:
# cut method on practice
pd.cut(df["age"], [0, 10, 18, 25, 40, 90], labels = ["child", "adolescence", "young-adult", "adult", "senior"])

Unnamed: 0,age
0,young-adult
1,adult
2,adult
3,adult
4,adult
...,...
886,adult
887,young-adult
888,
889,adult


In [45]:
# qcut method on practice
pd.qcut(df["age"], q=[0, 0.20, 0.40, 0.60, 0.80, 1], labels = ["child", "adolescence", "young-adult", "adult", "senior"])

Unnamed: 0,age
0,adolescence
1,adult
2,young-adult
3,adult
4,adult
...,...
886,young-adult
887,child
888,
889,young-adult


<a id="del_col"></a>
## Deleting Rows and Columns
Essential part of this process is `"axis"`:
* `"axis"=0` means rows
* `"axis"=1` means columns

### Deleting Columns

In [46]:
df.drop("fare", axis=1) # deletes `fare` column temporarily

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,C,First,man,True,C,Cherbourg,yes,True


In [47]:
df.drop("fare", axis=1, inplace=True) # deletes `fare` column permanently
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,C,First,man,True,C,Cherbourg,yes,True


### Deleting Rows

In [48]:
index_list = [0, 2, 4, 6]
df.drop(index_list, axis=0)  # deletes the defined index list

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,S,First,woman,False,C,Southampton,yes,False
5,0,3,male,,0,0,Q,Third,man,True,,Queenstown,no,True
7,0,3,male,2.0,3,1,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,S,Third,woman,False,,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,C,First,man,True,C,Cherbourg,yes,True


In [49]:
# resetting the df
df = titanic_dataset.copy()
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


<a id="missing"></a>
## Missing Values

### Calculating the Missing Values

    df.isnull() # returns boolean values; `False` for correct data, `True` for missing data
    df.isnull().sum() # returns total count of missing values for each column
    df.isnull().sum().sum() # returns total count of missing values for whole data frame

In [50]:
df.isnull()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False
889,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [51]:
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [52]:
df.isnull().sum().sum()

869

### Handling Missing Values

#### Deleting Missing Values

    df.dropna()  # Deletes every row that has NaN value, defult value for axis is 0
    df.dropna(axis=1)  # Deletes every column that has NaN value if axis is entered as 1
    df.dropna(thresh=2)  # Deletes every row that has at least 2 NaN values in it

#### Filling Missing Values

    df.fillna(value=1)  # Fills missing values with 1

    def mean_val(df, column):
      return int(df[column].mean())

    df["age"].fillna(value=mean_val(df, "age")) #fills missing values with mean value

<a id="groupby"></a>
## Groupby Method

Groupby method groups the values by a selected column and also aggregation functions can be applied on.

### With Single Aggregation Function

#### First Method

In [53]:
df.groupby("sex")["age"].sum() # grouped by sex and summarized their age

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,7286.0
male,13919.17


In [54]:
df.groupby("sex")["age"].mean() # grouped by sex and return the mean age of the corresponding gender

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [55]:
df.groupby("pclass")["fare"].mean() # grouped by passenger class and returned the mean fare for the corresponding class

Unnamed: 0_level_0,fare
pclass,Unnamed: 1_level_1
1,84.154687
2,20.662183
3,13.67555


#### Second Method

In [56]:
df.groupby("sex").agg({"age": "sum"}) # grouped by sex and summarized their age

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,7286.0
male,13919.17


In [57]:
df.groupby("sex").agg({"age": "mean"}) # grouped by sex and return the mean age of the corresponding gender

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [58]:
df.groupby("pclass").agg({"fare": "mean"}) # grouped by passenger class and returned the mean fare for the corresponding class

Unnamed: 0_level_0,fare
pclass,Unnamed: 1_level_1
1,84.154687
2,20.662183
3,13.67555


### With Multiple Aggregation Functions

In [59]:
df.groupby("sex").agg({"age": ["sum", "mean", "count"]})
# grouped by sex and returned the sum, mean and count value of the age

Unnamed: 0_level_0,age,age,age
Unnamed: 0_level_1,sum,mean,count
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,7286.0,27.915709,261
male,13919.17,30.726645,453


In [60]:
df.groupby("sex").agg({"age": ["mean", "count"],
                       "survived": "mean"})
# grouped by sex and returned the mean and the count value of the age and the mean value of the survived column

Unnamed: 0_level_0,age,age,survived
Unnamed: 0_level_1,mean,count,mean
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,27.915709,261,0.742038
male,30.726645,453,0.188908


<a id="pivot"></a>
## Pivot Tables in Pandas

Pivot tables are similar to **groupby method** but there is a catch.

In [61]:
df.pivot_table(values="survived", index="sex", columns="embarked")
# survived ratio based on gender and embarkation

embarked,C,Q,S
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.876712,0.75,0.689655
male,0.305263,0.073171,0.174603


In [62]:
df.pivot_table(values="survived", index="sex", columns="embarked", aggfunc="std")
# the same as above but an aggregated function is changed as standard deviation (default is 'mean')

embarked,C,Q,S
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.331042,0.439155,0.463778
male,0.462962,0.263652,0.380058


In [63]:
df.pivot_table("survived", "sex", ["embarked", "class"])
# multiple columns can be added

  df.pivot_table("survived", "sex", ["embarked", "class"])


embarked,C,C,C,Q,Q,Q,S,S,S
class,First,Second,Third,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
female,0.976744,1.0,0.652174,1.0,1.0,0.727273,0.958333,0.910448,0.375
male,0.404762,0.2,0.232558,0.0,0.0,0.076923,0.35443,0.154639,0.128302


# Exploratory Data Analysis(EDA)

## General Information About a Dataset

* Head and tail of the dataset
* Shape and size of the dataset
* Data types of the dataset
* Index pattern of the dataset
* Number of dimensions that the dataset has
* Statistical description of the dataset