In this <b>notebook</b> we will ask some <b>questions</b> when we first get our data to find some insights. 

In [32]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('titanic.csv')

In [5]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### How does the data look like?

To get a preview of the data, you can use some methods to learn how the data is stored in the rows and what are the column names.

In [9]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [10]:
df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [16]:
"""
Sometimes when you get the data, the data might be biased in a way that the first few rows are holding info about say, for 
childrens, then in the middle it might be storing data about adults and might hold data about old people in the last few
rows. In that case, when you use head() or tail() method you might wonder the data is about one kind of people only. So
to get a preview about the whole dataset randomly you can use sample() method.
"""
df.sample(4)        #Gives you 4 random rows from the dataset.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
566,567,0,3,"Stoytcheff, Mr. Ilia",male,19.0,0,0,349205,7.8958,,S
343,344,0,2,"Sedgwick, Mr. Charles Frederick Waddington",male,25.0,0,0,244361,13.0,,S
510,511,1,3,"Daly, Mr. Eugene Patrick",male,29.0,0,0,382651,7.75,,Q
511,512,0,3,"Webber, Mr. James",male,,0,0,SOTON/OQ 3101316,8.05,,S


### How big is the data?

Normally you will get smaller datasets. But when you get larger dataset, it's wise to know how big is the dataset and how to approach them.

In [17]:
df.shape

(891, 12)

### What is the data type of columns?

In your dataset, there will be some columns and those columns will store different types of data. So to know beforehand what types of data are being stored you can use <b>info()</b> method to know about the data types. After knowing the types of data if you find some columns which are taking extra memories for unnecessary reasons then you can change their types to some convenient ones to ensure wasting extra memory space. As a result, this type of optimization will make you models run faster.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Are there any missing values?

You might have missing values in your dataset. If your data contains missing values, then you should deal with it before feeding your data into a Machine Learning model. Thus, it is necessary to find out the information about missing values so that you can take essential steps. 

You can find out the info about missing values by<b>info()</b> method. But to find out the no. of missing values in every column you can use this code.

In [26]:
"""
After using this code, you make decisions on whether to fill the data of columns with missing values using some technique
or to discard those columns which have a lot of missing values.
"""

df.isnull().sum()    

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [28]:
#df.isnull()             #Detets the missing values

### How does the data look mathematically?

<b>describe()</b> method gives you a <b style = "color:orange">high level mathematical summary</b> on <b>numerical</b> columns.

In [29]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [53]:
df[['Age','Fare']].quantile(.05)      #getting the 5th percentile of Age and Fare columns

Age     4.000
Fare    7.225
Name: 0.05, dtype: float64

### Are there any duplicate values?

If you feed your ML model duplicate values, it will not be good for the prediction of your test cases. So, you need to find out if there are duplicate values in your dataset and if so, then remove them from your dataset using <b>drop_duplicate()</b> method. 

In [56]:
df.duplicated().sum()         #0 means you don't have any duplicate rows in your dataset.

0

### How is the correlation between columns?
<b>corr()</b> method is used to find the pairwise correlation of all columns in the dataframe. Any <b>na</b> values are automatically excluded. <b>Non-numeric</b> data type columns in the dataframe are automatically ignored.

All the columns of your dataset might not be imperative for training your ML model. In that case, by using corr() method you may able to identify which columns are crucial and which are not. After that you can discard those unnecessary columns and train your model with only those columns which will give you the <b>desired</b> result. 

#### <b style = "color:orange">Type of Correlation</b>

- A positive correlation is a relationship between two variables that tend to move in the same direction.
- A negative correlation indicates two variables that tend to move in opposite directions: a positive change in one variable will be accompanied by a negative change in the other variable.
- A correlation of zero means there is no relationship between the two variables.
- A correlation coefficient of 1 indicates a perfect correlation between two columns.

In [59]:
df.corr()    #Gives you the Pearson correlation coefficient i.e. gives you values between -1 to 1

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [62]:
"""
We want to know the correlation of every columns with our output column to know which columns are the most crucial.
"""
df.corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Name: Survived, dtype: float64