# Task 2 - Data Summary 
 

### Problem Statement :

- Calculate statistical summary measures (mean, median, mode, std, etc.) on the Titanic dataset to analyze data distribution.

## Solution 

### Statistical Summary - 

- A statistical summary is a concise representation of key statistical measures or descriptors that describe a dataset's central tendencies, variability, and distribution. It typically includes measures such as the mean, median, mode, standard deviation, range, and percentiles, providing a quick overview of the data's characteristics without detailing every individual data point.
    * Mean : The average value calculated by adding up all values and dividing by the total count.
    * Median : The middle value in a dataset when arranged in ascending or descending order; it divides the data into two equal halves.
    * Mode : The value that appears most frequently in a dataset.
    * Standard Deviation : A measure of the amount of variation or dispersion in a dataset; it indicates how spread out the values are around the mean.
    * Variance : Variance measures how far individual data points are from the average (mean) of the dataset, emphasizing the overall spread of the data.
    
    
    
### Why it is Important to perform Statistical Summary ?  

   * Data Understanding: Provides a quick snapshot of essential characteristics, aiding in understanding the dataset at a glance.
   * Central Tendency: Reveals the typical or central values like mean, median, and mode, showcasing where most data points lie.
   * Variability Assessment: Highlights how spread out or clustered the data is, crucial for assessing consistency or dispersion.
   * Identification of Outliers: Helps in spotting extreme or unusual values that might skew interpretations or analyses.
   * Data Comparison: Enables easy comparison between different datasets or subsets within a dataset.
   * Decision Making: Facilitates informed decision-making by providing insights into trends and patterns within the data.



### How Statistical Summary helps in Data Analysis ? 

   * Insights Generation: Summaries reveal patterns, trends, and central tendencies, aiding in forming hypotheses or initial insights.
   * Data Cleaning: Identifies outliers or anomalies, guiding the cleaning process to enhance data quality.
   * Feature Selection: Informs decisions on which features or variables might be most relevant for analysis based on their distributions.
   * Model Building: Helps in selecting appropriate models and validating assumptions by understanding the nature and spread of data.
   * Comparative Analysis: Enables comparison between groups or datasets, supporting hypothesis testing or identifying differences.
   * Visualizations: Guides the creation of visual representations that effectively communicate key aspects of the data to stakeholders.








* For this task, we are using Pandas for  calculating statistical summaries because it offer various built-in functions that swiftly compute various statistical measures, simplifying complex operations on structured datasets in Python. 
  

In [1]:
# Importing Pandas Library 

import pandas as pd 

In [2]:
# Loading the Titanic DataSet 

# Since the dataset is in .csv format using Pandas function pd.read_csv() to load the dataset 

df = pd.read_csv('Titanic.csv')

In [3]:
# Displaying the first few rows of the Titanic DataFrame 

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Getting the shape (rows,cols) of the DataFrame

df.shape

(891, 12)

### Different Methods used to calculate Statistical Summary in Pandas - 

##### 1. Individual Calculations : 
   * To calculate all the Statistical Values using Individual functions -
       - mean()   : dataframe['column_name'].mean()
       - median() : dataframe['column_name'].median()
       - mode()   : dataframe['column_name'].mode()      OR         dataframe['column_name'].value_counts().idxmax()
       - std()    : dataframe['column_name'].std()
       - var()    : dataframe['column_name'].var()

##### 2. Describe Function : 
   * It  provides a concise summary of descriptive statistics for numerical columns within a DataFrame, presenting key measures like mean, standard deviation, minimum, maximum, and quartile values.
       - describe() : dataframe.describe()


#### Note - Statistical methods work exclusively with numeric data; they can't be applied to non-numeric or categorical data like strings.

 First we have to fetch all the columns whose dataType is either 'int' or 'float' in order to perform Statistical Calculations  


In [5]:
# Getting the dataType of all the columns 

df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [6]:
# Creating a new dataFrame and Using include() function to get float and int dataType

df_num = df.select_dtypes(include=['int','float'])
df_num

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [7]:
# We can also use exclude() function for the same output 

df_num_ex = df.select_dtypes(exclude='object')
df_num_ex

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


## Performing Statistical Calclulations on the new DataFrame 


##### Method 1 

In [8]:
# Mean
print('Mean - ')
mean = df_num.mean()
mean

Mean - 


PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

In [9]:
# Median 

print("Median - ")
median = df_num.median()
median

Median - 


PassengerId    446.0000
Survived         0.0000
Pclass           3.0000
Age             28.0000
SibSp            0.0000
Parch            0.0000
Fare            14.4542
dtype: float64

In [10]:
# Mode 

print("Mode - ")
mode = df_num.mode()
mode.head(1)             # It  retrieves the first row (head) of the DataFrame containing mode values for each column

Mode - 


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0.0,3.0,24.0,0.0,0.0,8.05


In [11]:
# Another approach to calculate mode is by using value_counts() function 

print("Mode - ")
mode = df_num.mode().value_counts().idxmax()    # It will return the most occuring value of a particular col in the dataFrame
mode

Mode - 


(1, 0.0, 3.0, 24.0, 0.0, 0.0, 8.05)

In [12]:
# Std 
print("Standard Deviation - ")
std = df_num.std()
std

Standard Deviation - 


PassengerId    257.353842
Survived         0.486592
Pclass           0.836071
Age             14.526497
SibSp            1.102743
Parch            0.806057
Fare            49.693429
dtype: float64

In [13]:
# Var 

print('Variance - ')
var = df_num.var()
var

Variance - 


PassengerId    66231.000000
Survived           0.236772
Pclass             0.699015
Age              211.019125
SibSp              1.216043
Parch              0.649728
Fare            2469.436846
dtype: float64

##### Method 2 

In [14]:
# Using describe() function 

stats_summ = df.describe()
stats_summ

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Few Conclusion based on the above Statistical Summary - 


* Survival Rate:

     - Around 38% of passengers survived, with a median survival rate of 0, indicating that the majority did not survive.


* Passenger Demographics:

     - The average age of passengers was around 29.7 years, with a median age of 28.
     - Most passengers were in the third class (median Pclass = 3), indicating a higher representation of lower classes.
     - The fare prices varied significantly, with an average of approximately 32.2 and a median of 14.4542.


* Variability:

     - The standard deviation for age is around 14.5, suggesting considerable variability in passenger ages.
     - Similarly, the fare prices display a high standard deviation of about 49.7, indicating a wide range in ticket prices.

Therefore, we have successfully calculated Statistical Summary for the given DataSet and made Conclusions.