# 1. Median, Mean, Mode & Percentile

In Machine Learning, median, mean, mode, and percentile are fundamental statistical concepts that are often used in data preprocessing, analysis, and model evaluation. Here’s a breakdown of each:

![image.png](attachment:3dbec1a9-7a25-4a14-8e10-c19224367542.png)

![image.png](attachment:287c974d-b276-47ea-ad9a-f355bfa8ea23.png)

![image.png](attachment:e252f483-61a2-4cb4-975e-e65d35198daa.png)

![image.png](attachment:2657b1fc-b792-47e8-bfb6-9bea919477ff.png)

![image.png](attachment:54665f7c-5c98-408c-a272-3d70b444fffe.png)

In [25]:
# Example 1: Height Dataset

import pandas as pd


# Define the Dataset

df = pd.read_csv(r"C:\Users\admin\Downloads\heights.csv")
print(df)


# Mean

Mean = df.height.mean()
print('\n Mean Height :', Mean)


# Median

Median = df.height.median()
print('\n Median Height :', Median)


# Percentile

Percentile = df.height.quantile(0.9)
print('\n Percentile of Height :', Percentile)

Outlier = df[df.height > Percentile]
print('\n', Outlier)


       name  height
0     mohan     5.9
1     maria     5.2
2     sakib     5.1
3       tao     5.5
4     virat     4.9
5    khusbu     5.4
6    dmitry     6.2
7    selena     6.5
8      john     7.1
9     imran    14.5
10     jose     6.1
11  deepika     5.6
12   yoseph     1.2
13    binod     5.5

 Mean Height : 6.05

 Median Height : 5.55

 Percentile of Height : 6.920000000000001

     name  height
8   john     7.1
9  imran    14.5


In [47]:
# Example 2: Airbnb Dataset

import pandas as pd


# Define the Dataset

df = pd.read_csv(r"C:\Users\admin\Downloads\AB_NYC_2019.csv")
print(df.price.describe())
print(df.shape)

# Percentile

Maximum = df.price.quantile(0.999)
Minimum = df.price.quantile(0.01)

# Detect the Outlier

Outlier = df[(df.price < Minimum) | (df.price > Maximum)]
print('\nOutlier :', Outlier)


# Remove the Outlier

df2 = df[(df.price > Minimum) & (df.price < Maximum)]
print(df2.price.describe())
print(df2.shape)



count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64
(48895, 16)

Outlier :              id                                             name    host_id  \
957      375249                  Enjoy Staten Island Hospitality    1887999   
1862     826690                 Sunny, Family-Friendly 2 Bedroom    4289240   
2675    1428154              Central, Peaceful Semi-Private Room    5912572   
2698    1448703              Beautiful 1 Bedroom in Nolita/Soho      213266   
2860    1620248    Large furnished 2 bedrooms- - 30 days Minimum    2196224   
...         ...                                              ...        ...   
48486  36280646                     Cable and wfi, L/G included.  272872092   
48647  36354776    Cozy bedroom in diverse neighborhood near JFK  273393150   
48832  36450814                         FLATBUSH HANG OUT AND 

# 2. Standard Deviation and Mean Absolute Deviation

Standard Deviation (SD) and Mean Absolute Deviation (MAD) are both measures of dispersion or variability in a dataset — they quantify how spread out the values are from the center (usually the mean).

![image.png](attachment:538a13e2-a543-42b5-9502-c5167d3fedbd.png)

![image.png](attachment:1bcdaa86-a209-4111-94c7-ce89389b6d72.png)

![image.png](attachment:eb3b7646-3357-4d33-8115-3d5d33bbe2a9.png)

# 3. Normal Distribution and Z Score

📊 Normal Distribution and 🧮 Z-Score are foundational concepts in statistics and Machine Learning. Understanding them is crucial for data preprocessing, outlier detection, probability estimation, and many ML algorithms.

![image.png](attachment:801898e2-50a4-493f-9c1e-378267077282.png)

![image.png](attachment:b98d7055-c26f-4b7b-ae42-8a59577f9fbd.png)

![image.png](attachment:d632503e-c0b9-41a7-83b1-0ffbb0ebd361.png)

![image.png](attachment:7609833c-e776-4649-a77a-f3366c0d6c53.png)

In [88]:
# Example: Bangalore Property Prices

import pandas as pd


# Define the Dataset

df = pd.read_csv(r"C:\Users\admin\Downloads\bhp.csv")
print(df.head())
print('\n Shape :',df.shape)

print('\n',df.price_per_sqft.describe())


# Percentile

Minimum = df.price_per_sqft.quantile(0.001)
Maximum = df.price_per_sqft.quantile(0.999)

df2 = df[(df.price_per_sqft > Minimum) & (df.price_per_sqft < Maximum)] 
print('\n Percentile Shape :',df2.shape)
print('\n',df2.price_per_sqft.describe())



# Standard Deviation

Mean = df2.price_per_sqft.mean()
Std_Dev = df2.price_per_sqft.std()

Upper = Mean + 4*Std_Dev
Lower = Mean - 4*Std_Dev

df3 = df2[(df2.price_per_sqft > Lower) & (df2.price_per_sqft < Upper)] 
print('\n Standard Deviation Shape :',df3.shape)
print('\n',df3.price_per_sqft.describe())



# Z-Score

df2['z-score'] = (df2.price_per_sqft - Mean)/Std_Dev

df4 = df2[(df2['z-score'] > -4) & (df2['z-score'] < 4)]
print('\n Z-Score Shape :',df4.shape)

print('\n',df4.price_per_sqft.describe())


                   location       size  total_sqft  bath   price  bhk  \
0  Electronic City Phase II      2 BHK      1056.0   2.0   39.07    2   
1          Chikka Tirupathi  4 Bedroom      2600.0   5.0  120.00    4   
2               Uttarahalli      3 BHK      1440.0   2.0   62.00    3   
3        Lingadheeranahalli      3 BHK      1521.0   3.0   95.00    3   
4                  Kothanur      2 BHK      1200.0   2.0   51.00    2   

   price_per_sqft  
0            3699  
1            4615  
2            4305  
3            6245  
4            4250  

 Shape : (13200, 7)

 count    1.320000e+04
mean     7.920337e+03
std      1.067272e+05
min      2.670000e+02
25%      4.267000e+03
50%      5.438000e+03
75%      7.317000e+03
max      1.200000e+07
Name: price_per_sqft, dtype: float64

 Percentile Shape : (13172, 7)

 count    13172.000000
mean      6663.653735
std       4141.020700
min       1379.000000
25%       4271.000000
50%       5438.000000
75%       7311.000000
max      50349.00

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['z-score'] = (df2.price_per_sqft - Mean)/Std_Dev


# 4. Logarithm

A logarithm is the inverse operation of exponentiation.
It answers the question:

"To what power must we raise a certain base to get a given number?"

![image.png](attachment:e697f3ea-73f0-46a6-ac89-27ded3f96c90.png)

![image.png](attachment:949f47ec-04d3-4f04-a01a-4ac5ca969811.png)

![image.png](attachment:1e7a4a49-d599-49f0-a495-5997068fd64b.png)

![image.png](attachment:dd901991-30bc-415e-834c-94547360c346.png)

![image.png](attachment:421a5d35-329c-47d2-a8fc-b6281777593c.png)

# 5. Log normal distribution

📈 Log-Normal Distribution (aka lognormal distribution) is a probability distribution of a random variable whose logarithm is normally distributed.

![image.png](attachment:202874f4-1e54-4052-95a3-7e49bd572d1c.png)

🧠 Key Idea:

Take the log of the data → If it forms a normal distribution, then the original data is log-normal.

![image.png](attachment:62639b53-002c-4c03-b9d0-24ec074308f0.png)

![image.png](attachment:0117f929-3536-499b-870d-3a28fb95db68.png)

![image.png](attachment:abd342e6-ff17-46f7-93f0-3cf683f85fab.png)

📌 Why It Matters in ML:

1. Helps in transforming skewed features to resemble normal distributions.

2. Useful for models that assume normality after transformation (e.g., linear models).

3. Important for log-transformations in preprocessing pipelines.

# 6. Sin Cos & Tan

📐 Sine (sin), Cosine (cos), and Tangent (tan) are fundamental trigonometric functions used in mathematics, physics, and computer science — and they also appear in Machine Learning fields like signal processing, computer vision, and geometry-based models.

![image.png](attachment:2da14e09-9dd0-472b-a56d-859cb208ae6f.png)

![image.png](attachment:37f529f1-902f-4f89-97cf-436cbf2d2a45.png)

![image.png](attachment:a97ce8a8-bac6-4b97-9e14-69ddd48ea7c3.png)

![image.png](attachment:ed352d93-7d11-43ee-99de-ba7250913975.png)

![image.png](attachment:05d039e9-49d9-41c7-85f9-2b0059a64aa4.png)

# 7. Cosine similarity & Cosine distance

🔍 Cosine Similarity and Cosine Distance are commonly used metrics in machine learning, especially in text analysis, recommendation systems, and clustering.

![image.png](attachment:0133100f-f978-434f-a11b-4d3f1962c083.png)

![image.png](attachment:ee061289-fd5a-458c-8939-6ff36aa22132.png)

![image.png](attachment:f65240c6-02cf-4711-ae11-1f6efe339fa6.png)

![image.png](attachment:d6027cc9-5b92-47e6-8885-fdf96d7bd142.png)

In [15]:
# Example:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])

# Cosine similarity
cos_sim = cosine_similarity(A, B)

# Cosine distance
cos_dist = 1 - cos_sim

print("Cosine Similarity:", cos_sim)
print("Cosine Distance:", cos_dist)


Cosine Similarity: [[0.97463185]]
Cosine Distance: [[0.02536815]]


# 8. A/B Testing

🧪 A/B Testing — Also known as Split Testing — is a fundamental method used to compare two versions of something to see which one performs better.

![image.png](attachment:dbc2723e-c364-451b-a9bd-900849abc43c.png)

![image.png](attachment:615fe6d1-274d-4b2f-ad05-633cfe7a055c.png)

![image.png](attachment:4553de6d-ea16-4261-a400-68ada8fdfde4.png)

![image.png](attachment:80557dec-15c6-4ed7-9b77-ee285b3ef4a5.png)

![image.png](attachment:81a0d9fd-13de-499c-9553-0fece8237df7.png)

# 9. Modified Z-Score

📊 Modified Z-Score — An Improved Outlier Detection Technique

The Modified Z-Score is a robust version of the Z-score used to detect outliers, especially when your data is not normally distributed or contains extreme values that distort the mean and standard deviation.

✅ Why Use It?

1. The regular Z-score uses mean and standard deviation, which are sensitive to outliers.

2. The modified Z-score uses the median and MAD (Median Absolute Deviation), making it more resilient to extreme values.

![image.png](attachment:2f88f853-11cd-4c77-8bac-8f6fee07b298.png)

![image.png](attachment:d25e91ff-3de3-45fb-85cb-3621e5a00c67.png)

![image.png](attachment:f202d6bf-8ff8-4c1d-8fa4-92abce10a772.png)

In [52]:
# Example: Detect Outliers in Movie Revenue using Modified Z-score

import pandas as pd
import numpy as np


# Step 1: Define the Dataset

df = pd.read_csv(r"C:\Users\admin\Downloads\movie_revenues.csv")
print('\n Shape :', df.shape)

print('\n', df.revenue.describe())

df['revenue_million'] = df['revenue'].apply(lambda x : x/1000000)
print('\n', df.revenue_million.describe())



# Step 2: Find the Outlier using Z-score

Mean = df.revenue_million.mean() 
print('\n Mean :', Mean)

Std_Dev = df.revenue_million.std() 
print('\n Standard deviation :', Std_Dev)


df['z_score'] = (df.revenue_million - Mean) / Std_Dev
print('\n z_score :', df['z_score'].head())


Outlier = df[df['z_score'] > 3]
print('\n Outlier :', Outlier)



# Step 3: Find the Outlier using Modified Z-score

Median = df.revenue_million.median() 
print('\n Median :', Median)

Difference = abs(df.revenue_million - Median)
print('\n Difference :', Difference)

MAD = Difference.median()
print('\n MAD :', MAD)

df['Modified_z_score'] = 0.6745*(df.revenue_million - Median) / MAD

MZ_Outlier = df[df['Modified_z_score'] > 3.5 ]
print('\n Modified Z Score :', MZ_Outlier)



# Step 3: Find the Outlier using Modified Z-score (Alternative)

def get_mad(s):
    median = np.median(s)
    diff = abs(s-median)
    Mad = np.median(diff)
    return Mad

Mad = get_mad(df.revenue_million)
print('\n Mad :', Mad)


def get_modified_z_score(x, Median, MAD):
    return 0.6745*(x-Median)/MAD

get_modified_z_score(2711, Median, MAD)


df['mod_z_score'] = df.revenue_million.apply(lambda x: get_modified_z_score(x, Median, MAD))
df.head(3)




 Shape : (46, 20)

 count    4.600000e+01
mean     1.879289e+08
std      4.551144e+08
min      8.522060e+05
25%      2.866957e+07
50%      8.381714e+07
75%      1.382135e+08
max      2.787965e+09
Name: revenue, dtype: float64

 count      46.000000
mean      187.928898
std       455.114423
min         0.852206
25%        28.669569
50%        83.817142
75%       138.213502
max      2787.965087
Name: revenue_million, dtype: float64

 Mean : 187.92889841304347

 Standard deviation : 455.1144234195408

 z_score : 0    5.712929
1   -0.126336
2   -0.351385
3   -0.411054
4    0.105463
Name: z_score, dtype: float64

 Outlier :       budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                      homepage     id  \
0  http://www.avatarmovie.com/  19995   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   

  ori

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,revenue_million,z_score,Modified_z_score,mod_z_score
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,2787.965087,5.712929,32.339762,32.339762
1,54000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",http://www.youmeanddupree.com/,1819,"[{""id"": 1253, ""name"": ""roommate""}, {""id"": 2038...",en,"You, Me and Dupree",After standing in as best man for his longtime...,18.600367,"[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Two's company. Dupree's a crowd.,"You, Me and Dupree",5.4,407,130.431368,-0.126336,0.557474,0.557474
2,21000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 53, ""name...",,2575,"[{""id"": 246, ""name"": ""dancing""}, {""id"": 470, ""...",en,The Tailor of Panama,A British spy is banished to Panama after havi...,7.047975,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,"In a place this treacherous, what a good spy n...",The Tailor of Panama,6.2,92,28.008462,-0.351385,-0.667434,-0.667434


# 10. Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data.

It helps answer questions like:

“Is the new marketing strategy increasing sales?”
“Does a new drug work better than the old one?”
“Are two machine learning models significantly different?”

✅ Core Idea:

You start with an assumption (the null hypothesis) and use data to test whether there's enough evidence to reject it in favor of an alternative.

![image.png](attachment:78a583cd-bd23-4ce6-89a7-e7c77c880e1a.png)

![image.png](attachment:e583ab9d-c830-49d9-884c-b82033c8221a.png)

![image.png](attachment:039e02ec-19e1-47c2-a02b-8887089c4d55.png)

![image.png](attachment:9a2017cc-0357-4560-99a1-e2afef097bf0.png)