# Exploratory Data Analysis on Titanic Dataset

**Objectives:** The main aim is to demonstrate the use of python libraries such as Pandas, Seaborn and Matplotlib for data analysis and vizualization.

**Secondary objectives:** 
* To load dataset using Pandas dataframes.
* To get the initial information about the dataset. 
* To get the descriptive statistics about the datset.
* To show how dataset is vizualized through different types of graphs.



# *Basic Operations on dataset*

1. **Importing Libraries and Loading the dataset**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
df.head()


2. **To print dimensional size of dataset**

In [None]:
print(df.shape)

3. **To print names of variables or features**

In [None]:
df.columns

4. **To print types of variables or features that is columns**

In [None]:
print(df.info())

5. **To print descriptive statistics of numeric variables.** 

In [None]:
df.describe()

6. **To print descriptive statistics of non- numeric variables that is categorical or discrete type**

In [None]:
df.describe(include='object')

7. **To print counts of distinct values for categorical variables**

In [None]:
df['Embarked'].value_counts()

8. **To print percentage of distinct values for categorical variables**

In [None]:
df['Sex'].value_counts(normalize=True)

# *Querying a Dataset or Dataframe*

1. **Projecting column or a Feature**

There are multiple ways to access a particular column or feature variable in Pandas. The first and simplest way is to index the dataframe with the name of the desired column. For example, df['Name'] returns all the values in the 'Name' column as a Pandas's series. Further, we can invoke the methods such as head(), mean() and max() on the column values returned as series.

In [None]:
df['Name'].head()

In [None]:
 df['Age'].mean()

In [None]:
df['Fare'].max()

2. **Projecting using loc() and iloc() method**

The second way of projecting columns and instances is to use loc[] and iloc[] methods. The loc[] method accesses columns with column name as indexes. At the same time, iloc[] uses integers as an index which indicate the column position.

In [None]:
#Displays 0 to 15 rows of columns from 'Name' to 'Age'.
df.loc[0:15,'Name':'Age']

In [None]:
#Displays 0 to 15 rows and 3 to 6 columns
df.iloc[0:15,3:6]

More complex queries can be constructed to extract rows or data instances, satisfying a particular condition—for example, printing name of the oldest passenger. Here, we need to first find the largest values for the age and print the data instances satisfying the condition age == max age.

In [None]:
#Suppose we need to find the oldest person on the ship then the it is displayed as:
df[df['Age']== df[df['Sex'] =='male']['Age'].max()]['Name']

3. **Sorting**

The method sort_values() can be used to sort the values in a dataframe's column. It takes two arguments. The first is the name of a column to be sorted (by) and the second is the order of sorting. (ascending).

In [None]:
df.sort_values(by='Age').head()

In [None]:
df.sort_values(by=['Name','Age'],ascending=[True,False]).head()

4. **Replacing Values in a column**

Sometimes, we need to replace all values in a column. Such situations arise when we want to convert all categorical values into supposing numeric values. Pandas provides two convenient ways to achieve this task. The first is through the application of map() method and the second is through the replace() method.

*  **By using map() function**

The following example replaces the values 'male' and 'female' in the 'Sex' column with values '0' and '1'. Here we define a new dictionary with old values of the column as keys(male, female) and new values as the value(0,1) in the dictionary. Then this dictionaory is passed as an argument to the map() method.

In [None]:
dictionary={'male':0, 'female':1}
df['Sex'] = df['Sex'].map(dictionary)
df.head()

* **By using replace() function**

Like in the previous example, we need to define a dictionaory. The newly created dictionary is included as a value and column name as the key in the dictionary to be passed as an argument to the repalce() method.

The following example restores the values in the 'Sex' column back to 'male' and 'female' through repalce() method.

In [None]:
oldval={0: 'male', 1: 'female'}
df= df.replace({'Sex':oldval})
df.head()

5. **Groupby**

The method groupby() can be used to group the data instances according to the values of a categorical variable and print the descriptive statistics value-wise.

For example, the data instances in the Titanic data set are grouped for 'Survived' (a categorical variable) and an average 'Age' (a numerical variable) for survived and not survived passengers are calculated.

In [None]:
df.groupby(by='Survived')['Age'].describe()

In [None]:
df.groupby(by='Embarked')['Age'].agg(np.mean)

Another way of grouping data instances or rows is through crosstab() method. It allows to group rows through multiple categorical variables.

The following example groups rows by 'Sex' and 'Survived' columns and presents genderwise percentage (normalize =True) of survived and not survived passengers.

In [None]:
pd.crosstab(df['Sex'],df['Survived'], normalize=True)

The method pivot_table() is the more flexible way of grouping rows. It allows to group rows through a categorical variable and presents the descriptive statistics for multiple numerical variables.

In [None]:
df.pivot_table(['Age', 'Fare'],['Survived'], aggfunc='mean')

# **Data Visualization**

Python provides diverse ways to visualize dataset through libraries such as Pandas, Matplotlib and Seaborn. The objective is to familiarize with different methods of visualization and how different kinds of graphs such as linechart, barchart, histogram, and scatter plot are plotted.

1. **Lineplot**
The first and most straightforward way is to depict a data set through a line graph. The line graph is typically used for numerical and time-series data.Here we use the lineplot() method from Seaborn to plot a graph age vs fare paid. Both are numeric data. In the line plot, two points are joined through lines.

In [None]:
sns.lineplot(x='Age',y='Fare',data=df)

2.  **Bar Graph using Factorplot**

The second kind of graph that is frequently used is the bar graph. The bar graphs are used to display counts of values for categorical or a discrete variable. Here we use a bar graph to display counts of survived and not survived passengers as a bar graph through the factorplot() method from Seaborn library

In [None]:
sns.factorplot(x='Survived', data=df, kind='count')

The factorplot() method can take an additional categorical variable through the 'hue' parameter. Here in the following example, we display the genderwise distribution of survived and not-survived passengers.

In [None]:
sns.factorplot(x='Survived', data=df, kind='count', hue='Sex')

3. **Histogram**


The histograms can be drawn through the hist() method from Pandas library. The hist() method takes the number of bins as one of the arguments. The histograms are useful to visualize the distribution of numeric of data. Here in this example, we display distribution of age through histograms.

In [None]:
df['Age'].hist(bins=10)

4. **KDEplot using FacetGrid()**

Another way to visualize the distribution of numeric data is through kernel density plot which displays the probability distribution. The following code segment uses FacetGrid() method to plot a KDE plot.

In [None]:
as_fig = sns.FacetGrid(df,hue='Sex',aspect=5)
as_fig.map(sns.kdeplot,'Age',shade=True)
oldest = df['Age'].max()
as_fig.set(xlim=(0,oldest))
as_fig.add_legend()

5. **Scatter Plot**

The scatter plot is used to observe how one variable relates to another variable. The scatter() method from Pandas library is used to draw the scatter plot.

The following examples plot two scatter plots. First is 'Age' vs 'Fare' and the second 'Age' vs 'Survived'.****

In [None]:
df.plot.scatter(x='Age',y='Fare')

In [None]:
df.plot.scatter(x='Age',y='Survived')

6. **PieChart using subplot()**

Plotting a pie chart is somewhat complicated in Python. Here we use the method subplots() from matplotlib and draw a pie chart to display value-wise composition for the 'Sex' variable.

In [None]:
import matplotlib.pyplot as plt
sizes= df['Survived'].value_counts()
fig1,ax1 = plt.subplots()
ax1.pie(sizes,labels=['Not Survived', 'Survived'],autopct='%1.1f%%',shadow=True)
plt.show()