# Real World Dataset Examples

# Titanic dataset

The seaborn library is a popular data visualization library built on top of matplotlib in Python. It provides a high-level interface for creating informative and visually appealing statistical graphics. One of the most commonly used datasets in seaborn is the Titanic dataset, which contains information about the passengers aboard the Titanic during its ill-fated maiden voyage in 1912.

The Titanic dataset is often used as a beginner's dataset for data exploration and visualization due to its relatively small size and comprehensible variables. It consists of various attributes about the passengers, such as their age, sex, ticket class, survival status, and more. This dataset is interesting to work with because it can offer insights into factors that influenced the passengers' chances of survival.

To work with the Titanic dataset in seaborn, you typically load it as a pandas DataFrame and then use seaborn's functions to create visualizations based on the data. This can help you uncover patterns or trends in the data, such as the relationship between passenger class and survival rates or the distribution of ages among the passengers.
The Titanic dataset consists of several columns that provide information about the passengers aboard the Titanic. Here is an explanation of the column headings ommonly found in the Titanic dataset:

PassengerId: A unique identifier for each passenger.\
Survived: Indicates whether a passenger survived or not. This column has binary values: 0 for not survived and 1 for survived.\
Pclass: Represents the passenger class or ticket class. It has three categories: 1 (first class), 2 (second class), or 3 (third class).\
Sex: Indicates the gender of the passenger: male or female.\
Age: The age of the passenger in years. It can contain fractional values if the age is estimated.\
SibSp: The number of siblings or spouses aboard the Titanic for each passenger.\
Parch: The number of parents or children aboard the Titanic for each passenger.\
Fare: The fare or ticket price paid by the passenger.\
Cabin: The cabin number or cabin letter where the passenger stayed.\
Embarked: Represents the port of embarkation. It can take three values: C (Cherbourg), Q (Queenstown), or S (Southampton).\
Class: Represent the passenger class in test form; first, second, third\
who: represent sex as man/woman.\
Adult_male: represents male sex as true/false
Deck: represent the deck assigned to passengers
Embark_town: represents the town of embarkation as a string.\
Alive: survived or not.\
Alone: True or false.

These column headings provide information about various aspects of the passengers, such as their demographics (e.g., age, sex), socio-economic status (e.g., passenger class), family relationships (e.g., number of siblings/spouses and parents/children), and ticket details (e.g., fare, cabin, embarkation port).








##Preliminary Analysis titanic

In [None]:
import seaborn as sns
import numpy

df = sns.load_dataset('titanic')
print("Shape", df.shape)
print(df.head(5))
print()
print("Describe for eligible columns")
print(df.describe())
print()
print("Describe for single columnn - age ")
df.describe()["age"]



Shape (891, 15)
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Describe for eligible columns
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.38383

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

## Basics; count, sum, min, max, range

In [None]:
# count with len

a = len(df)
print(a)

# count(value, start, end) DOESN'T WORK WITH PANDAS USE VALUE COUNT
#b = DataFrameFormatter.count(86)
#print(b)

# sum
c = df['fare'].sum()
print(c)

# min
x = df['age'].min()
print(x)

# max
y = df['age'].max()
print(y)

# range
z = y-x
print (z)

891
28693.9493
0.42
80.0
79.58


## Averages - measures of central tendency

In [None]:
x = df['age'].mean()
print("mean ",x)

y = df['age'].median()
print("median ",y)

z = df['age'].mode().values

print("mode ",z)

mean  29.69911764705882
median  28.0
mode  [24.]


## Standard Deviations & Variance

In [None]:
x = df['age'].std()
print("standard deviation ",round(x,2))

x = df['age'].var()
print("variance ",round(x,2))

standard deviation  14.53
variance  211.02


The variance can be a lot higher than the standard deviation when dealing with datasets that have large differences from the mean or when outliers significantly influence the spread. While the variance provides useful information about the spread, the standard deviation is often preferred in practice because it is easier to interpret and is expressed in the same units as the original data.

## Percentiles & Quartiles

In [None]:
# Calculate the percentiles (25th, 50th, and 75th) of the 'age' column
age_percentiles = df['age'].quantile([0.25, 0.50, 0.75])

# Print the result
print("25th Percentile (Q1):", age_percentiles[0.25])
print("50th Percentile (Median or Q2):", age_percentiles[0.50])
print("75th Percentile (Q3):", age_percentiles[0.75])

25th Percentile (Q1): 20.125
50th Percentile (Median or Q2): 28.0
75th Percentile (Q3): 38.0


In the example above, the 25th percentile is 20.125. It means that 25% of the data in the dataset is below the value 20.125, and 75% of the data is above it.

Similarly, the 50th percentile is commonly known as the median, and it represents the value that divides the dataset into two equal halves. In our example, the median would be 28.

The 75th percentile represents the value below which 75% of the data falls. In this example, the 75th percentile would be 38.

## value count and percentages

In [None]:
# Count and percentage for ALL classes
# Get the category count for the 'Class' column
class_count = df['class'].value_counts()

# Get the percentages by dividing the count by the total number of records and multiplying by 100
class_percentages = (class_count / len(df)) * 100

# Print the results for all
print("Category Count:")
print(class_count)
print("\nPercentages:")
print(class_percentages)

print()

# Count and percentage for first class only
# Get the count of passengers in the first class
first_class_count = df['class'].value_counts()['First']

# Calculate the total number of passengers
total_passengers = len(df)

# Calculate the percentage of passengers in the first class
first_class_percentage = (first_class_count / total_passengers) * 100

# Print the results for first class
print("Count of passengers in First Class:", first_class_count)
print("Percentage of passengers in First Class: {:.2f}%".format(first_class_percentage))



Category Count:
Third     491
First     216
Second    184
Name: class, dtype: int64

Percentages:
Third     55.106622
First     24.242424
Second    20.650954
Name: class, dtype: float64

Count of passengers in First Class: 216
Percentage of passengers in First Class: 24.24%


## Z-scores outlier detection

In [None]:
import pandas as pd
from scipy import stats

# Load the Titanic dataset into a pandas DataFrame
data = df

# Drop rows with missing values in the 'Age' column
data = data.dropna(subset=['age'])

# Calculate z-scores for the 'Age' column
z_scores = stats.zscore(data['age'])

# Set a threshold for outliers (e.g., z-score greater than 3 or less than -3)
threshold = 3

# Detect outliers based on z-scores
outliers = data[abs(z_scores) > threshold]

# Print the outliers
print("Outliers:")
print(outliers)



Outliers:
     survived  pclass   sex   age  sibsp  parch    fare embarked  class  who  \
630         1       1  male  80.0      0      0  30.000        S  First  man   
851         0       3  male  74.0      0      0   7.775        S  Third  man   

     adult_male deck  embark_town alive  alone  
630        True    A  Southampton   yes   True  
851        True  NaN  Southampton    no   True  


The Titanic dataset contains missing values in the 'age' column. When calculating z-scores with missing values, they are treated as NaN, and the result is not reliable. To avoid this, you must drop the rows with missing values in the 'age' column before calculating z-scores using dropna().

## correlation coefficient

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')

# Select two numerical columns for the bivariate analysis
column1 = 'age'
column2 = 'fare'

# Drop rows with missing values in the selected columns
titanic_data = titanic_data.dropna(subset=[column1, column2])

# Calculate correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(titanic_data[column1], titanic_data[column2])

print("Correlation coefficient:",correlation_coefficient)
print("P-value:",p_value )

if p_value < 0.05:
    print("The correlation is statistically significant at the 5% level.")
else:
    print("There is no statistically significant correlation.")



Correlation coefficient: 0.0960666917690389
P-value: 0.010216277504447006
The correlation is statistically significant at the 5% level.


It is possible for the correlation coefficient (r-value) to be low, indicating a weak linear relationship between two variables, while the p-value remains statistically significant. The correlation coefficient measures the strength and direction of the linear relationship between variables, while the p-value evaluates the significance of that relationship.

The p-value represents the probability of obtaining the observed correlation coefficient (or a more extreme value) if there were no true correlation in the population. If the p-value is below a pre-defined significance level (commonly set at 0.05 or 0.01), it suggests that the observed correlation is unlikely to be due to chance alone.

Here's an example scenario to illustrate how this can happen:

Let's say we have two variables, X and Y, and we observe a weak correlation between them:

r-value (correlation coefficient) = 0.10 (weak correlation)
p-value = 0.02 (significant at 5% level)
In this example, the correlation coefficient (r-value) of 0.10 indicates a weak linear relationship between X and Y. However, the p-value of 0.02 means that there is only a 2% chance of observing this weak correlation (or a stronger correlation) if there were no true correlation between X and Y in the population.

The low r-value suggests that the linear relationship is not particularly strong, but the low p-value suggests that it is unlikely to be purely due to random chance, and there might still be some meaningful relationship between the variables that is not captured well by the linear correlation coefficient.

It's essential to consider both the correlation coefficient and the p-value when interpreting the relationship between variables. A significant p-value indicates that there is evidence of an association, even if the correlation is weak. However, it's crucial to keep in mind that a statistically significant relationship doesn't necessarily imply a strong or practically meaningful relationship between the variables.
