##Day 24 - DIY Solution

**Q1. Problem Statement: Conditional Probability**

Load the “kerala.csv” data into a DataFrame and perform the following tasks:
1.	Explore the DataFrame using info() and describe() functions
2.	June and July are the peak months of rainfall. Consider that if it rains more than 500mm, then chances of flood become more; create a Datarame with columns –“YEAR”,  “JUN_GT_500” (Contains a boolean value to show whether it rained more thn 500 mm in the month of June) , “JUL_GT_500” (Contains a boolean value to show whether it rained more thn 500 mm in the month of July), and “FLOODS” (Contains a boolean value to show whether it flooded that year)
3.	Calculate the probability of flood given it rained more than 500 mm in June (P(A|B))
4.	Calculate the probability of rain more than 500 mm in June, given it flooded that year (P(B|A))
5.	Probability of flood given it rained more than 500 mm in July
6.	Probability of rain more than 500 mm in July given it flooded that year (P(B|A))



**Step-1:** Loading the dataset into a DataFrame.

In [13]:
# Import libraries
import numpy as np
import pandas as pd

# Read the data
df = pd.read_csv("/content/kerala.csv")
df.head()

Unnamed: 0,SUBDIVISION,YEAR,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,ANNUAL RAINFALL,FLOODS
0,KERALA,1901,28.7,44.7,51.6,160.0,174.7,824.6,743.0,357.5,197.7,266.9,350.8,48.4,3248.6,YES
1,KERALA,1902,6.7,2.6,57.3,83.9,134.5,390.9,1205.0,315.8,491.6,358.4,158.3,121.5,3326.6,YES
2,KERALA,1903,3.2,18.6,3.1,83.6,249.7,558.6,1022.5,420.2,341.8,354.1,157.0,59.0,3271.2,YES
3,KERALA,1904,23.7,3.0,32.2,71.5,235.7,1098.2,725.5,351.8,222.7,328.1,33.9,3.3,3129.7,YES
4,KERALA,1905,1.2,22.3,9.4,105.9,263.3,850.2,520.5,293.6,217.2,383.5,74.4,0.2,2741.6,NO


**Step-2:** Replacing the target column with numeric values (0 and 1).

In [14]:
# Changing the target column to numeric values
df["FLOODS"] = df["FLOODS"].map({"YES": 1, "NO": 0})

**Step-3:** Creating binary data for the months of June and July using the rainfall threshold as 500mm.

In [15]:
#Creating binary data for the months of June and July using the rainfall threshold
df["JUN_GT_500"] = (df["JUN"] > 500).astype("int")
df["JUL_GT_500"] = (df["JUL"] > 500).astype("int")
df_small = df.loc[:, ["YEAR", "JUN_GT_500", "JUL_GT_500", "FLOODS"]]
df_small["COUNT"] = 1
df_small.head()

Unnamed: 0,YEAR,JUN_GT_500,JUL_GT_500,FLOODS,COUNT
0,1901,1,1,1,1
1,1902,0,1,1,1
2,1903,1,1,1,1
3,1904,1,1,1,1
4,1905,1,1,0,1


In [16]:
df_small.shape

(118, 5)

**Step-5:** Creating the tabular data based on the counts.

In [17]:
# Creating the tabular data based on the counts
pd.crosstab(df_small["FLOODS"], df_small["JUN_GT_500"])

JUN_GT_500,0,1
FLOODS,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19,39
1,6,54


**Step-6:** Defining the variables:

P(F): Probability of flooding

P(J): Probability of having more than 500 mm rain in June

P(F ∩ J): Probability of flooding and having more than 500 mm rain in June

P(F|J): Probability of flooding given it rained more than 500 mm in June

Based on the above table we can easily find these probabilities.

In [18]:
P_F = (6 + 54) / (6 + 54 + 19 + 39)
P_J = (39 + 54) / (6 + 54 + 19 + 39)
P_F_intersect_J = 54 / (6 + 54 + 19 + 39)
print(f"P(Flood): {P_F}") 
print(f"P(June): {P_J}")
print(f"P(Flood AND June): {P_F_intersect_J}")

P(Flood): 0.5084745762711864
P(June): 0.788135593220339
P(Flood AND June): 0.4576271186440678


**Step-7:** Using the formula - P(A|B) = P(A ∩ B) / P(B) calculate the conditional probability.

In [19]:
# Now calculate probabilitity of flood given it rained more than 500 mm in June (P(A|B))
P_F_J = P_F_intersect_J / P_J
print("Probailitity of flood given it rained more than 500 mm in June (P(A|B)): ")
print(f"P(Flood|June): {P_F_J}")

Probailitity of flood given it rained more than 500 mm in June (P(A|B)): 
P(Flood|June): 0.5806451612903226


We can conclude that: Given that it flooded in Kerala in a given year what is the probability that it rained more than 500 mm in the month of June or July? This is where Bayes Theorem comes into action.

**Step-8:** Probability of rain more than 500 mm in June given it flooded that year (P(B|A)).

In [20]:
# Probability of rain more than 500 mm in June given it flooded that year (P(B|A))
P_J_F = (P_F_J * P_J) / P_F
print("Probability of rain more than 500 mm in June given it flooded that year (P(B|A)): ")
print(f"P(June|Flood): {P_J_F}")

Probability of rain more than 500 mm in June given it flooded that year (P(B|A)): 
P(June|Flood): 0.9000000000000001


**Step-9:** Creating the tabular data based on the counts for July.

In [21]:
# We can similarly do it for july
pd.crosstab(df_small["FLOODS"], df_small["JUL_GT_500"])

JUL_GT_500,0,1
FLOODS,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19,39
1,3,57


**Step-10:** Defining the similar parameters for July:

P(F): Probability of flooding 

P(J): Probability of having more than 500 mm rain in July 

P(F ∩ J): Probability of flooding and having more than 500 mm rain in July 

P(F|J): Probability of flooding given it rained more than 500 mm in July

In [22]:
P_F = (3 + 57) / (3 + 57 + 19 + 39)
P_J = (39 + 57) / (3 + 57 + 19 + 39)
P_F_intersect_J = 57 / (3 + 57 + 19 + 39)
print(f"P(Flood): {P_F}") 
print(f"P(July): {P_J}")
print(f"P(Flood AND July): {P_F_intersect_J}")

P(Flood): 0.5084745762711864
P(July): 0.8135593220338984
P(Flood AND July): 0.4830508474576271


**Step-11:** Now calculate probailitity of flood given it rained more than 500 mm in July.

In [23]:
# Now calculate probabilitity of flood given it rained more than 500 mm in July
P_F_J = P_F_intersect_J / P_J
print("Probabilitity of flood given it rained more than 500 mm in July: ")
print(f"P(Flood|July): {P_F_J}")

Probabilitity of flood given it rained more than 500 mm in July: 
P(Flood|July): 0.59375


**Step-12:** # Probability of rain more than 500 mm in July given it flooded that year (P(B|A)).

In [24]:
# Probability of rain more than 500 mm in July given it flooded that year (P(B|A))
P_J_F = (P_F_J * P_J) / P_F
print("Probability of rain more than 500 mm in July given it flooded that year (P(B|A)): ")
print(f"P(July|Flood): {P_J_F}")

Probability of rain more than 500 mm in July given it flooded that year (P(B|A)): 
P(July|Flood): 0.9500000000000002


Based on the probability outputs above we can easily infer that it flooded almost 59% of the time in the year when it rained more than 500 mm in July whereas for June it's only 58%. This means only rainfall in the months of June and July are not completely responsible for the flooding in Kerala.
But, Using Bayes theorem we found that whenever it flooded in Kerala, both June and July have a very high probability (90% and 95% respectively) of rain for more than 500 mm.