<a href="https://colab.research.google.com/github/Sivaram-G97/Python_Projects/blob/main/Bird_Migration_Analysis/Bird_Migration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import the necessary libraries


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
import geopandas as gpd
from shapely.geometry import Point


In [None]:
df = pd.read_excel('/content/Bird_Migration_data.xlsx')

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

#Handling Missing Values

# --- Imputing missing 'Interrupted_Reason' based on group consistency ---

* To facilitate group-based imputation, we will first round the 'Start_Latitude' and 'Start_Longitude' to two decimal places. This effectively groups locations within a roughly 10km radius.

* We will group the bird migration data by 'Species', 'Start_Latitude', 'Start_Longitude', and 'Migration_Start_Month'. The goal is to leverage instances where at least one bird within such a specific group has a recorded 'Interrupted_Reason'.

* If we find a reason, we will propagate that same reason to all other birds in the same group with missing 'Interrupted_Reason'.

* The underlying assumption is that birds sharing these characteristics (species, starting point, and migration start time) are more likely to experience the same causes for migration interruption.

* Before attempting the imputation, we will first verify if it's feasible by checking if each group contains at least one non-null 'Interrupted_Reason' value.

In [None]:
missing_reason_df = df[df['Interrupted_Reason'].isnull()]

In [None]:
df_copy = df.copy()

In [None]:
df_copy['Start_Latitude'] = df_copy['Start_Latitude'].round(1)
df_copy['Start_Longitude'] = df_copy['Start_Longitude'].round(1)

In [None]:
df_copy['Start_Longitude'].head(10)

In [None]:
len(df_copy['Start_Longitude'].unique())


In [None]:
len(df_copy['Start_Latitude'].unique())

In [None]:
grouping_cols = ['Species', 'Start_Latitude', 'Start_Longitude', 'Migration_Start_Month']
reason_col = 'Interrupted_Reason'

grouped = df_copy.groupby(grouping_cols)

In [None]:
def can_impute_simple(group):
    has_non_null = group[reason_col].notna().any()
    has_null = group[reason_col].isnull().any()
    return has_non_null and has_null

In [None]:
imputation_possible = grouped.apply(can_impute_simple).any()


In [None]:
print(imputation_possible)

Now it's evident that imputation isn't possible


# Univariate Analysis
## Checking for Outliers and spread


In [None]:
df['Flight_Distance_km'].hist(bins=25)
plt.xlabel('Avg Flight Distance (km)')
plt.ylabel('Frequency')
plt.title('Distribution of Flight Distances')

In [None]:
df['Flight_Duration_hours'].hist(bins=25)
plt.xlabel('Avg Flight hours')
plt.ylabel('Frequency')
plt.title('Distribution of Flight hours')

In [None]:
sns.kdeplot(df['Average_Speed_kmph'])
plt.xlabel('Average Speed (kmph)')
plt.ylabel('Frequency')
plt.title('Distribution of Average Speed')

In [None]:
altitude_data = df[['Max_Altitude_m', 'Min_Altitude_m']].melt(var_name = 'Altitude_Type', value_name = 'Altitude')

In [None]:
sns.boxplot(x = 'Altitude_Type', y = 'Altitude', data = altitude_data)
plt.xlabel('Altitude Type')
plt.ylabel('Altitude')
plt.title('Distribution of Altitude')

In [None]:
sns.kdeplot(df['Temperature_C'])
plt.xlabel('Temperature (C)')
plt.ylabel('Frequency')
plt.title('Distribution of Temperature')

In [None]:
sns.kdeplot(df['Wind_Speed_kmph'])
plt.xlabel('Wind Speed (kmph)')
plt.ylabel('Frequency')
plt.title('Distribution of Wind Speed')

In [None]:
sns.kdeplot(df['Humidity_%'])
plt.xlabel('Humidity (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Humidity')

In [None]:
sns.kdeplot(df['Pressure_hPa'])
plt.xlabel('Pressure (hPa)')
plt.ylabel('Frequency')
plt.title('Distribution of Pressure')

In [None]:
sns.kdeplot(df['Visibility_km'])
plt.xlabel('Visibility (Kms)')
plt.ylabel('Frequency')
plt.title('Distribution of Visibility')

In [None]:
sns.countplot(x = 'Species', data = df)
plt.xlabel('Species')
plt.ylabel('Frequency')
plt.title('Distribution of Species')

In [None]:
sns.countplot(x='Region', data = df)
plt.xlabel('Region')
plt.ylabel('Frequency')
plt.title('Distribution of Region')

In [None]:
sns.countplot(x = 'Habitat', data = df)
plt.xlabel('Habitat')
plt.ylabel('Frequency')
plt.title('Distribution of Habitat')

In [None]:
sns.countplot(x = 'Weather_Condition', data = df)
plt.xlabel('Weather Condition')
plt.ylabel('Frequency')
plt.title('Distribution of Weather Condition')

In [None]:
sns.countplot(x = 'Migration_Reason', data = df)
plt.xlabel('Migration Reason')
plt.ylabel('Frequency')
plt.title('Distribution of Migration Reason')

In [None]:
df.columns

In [None]:
sns.countplot(x = 'Migration_End_Month', data = df)
plt.xlabel('Migration End Month')
plt.ylabel('Frequency')
plt.title('Distribution of Migration End Month')

In [None]:
sns.countplot(x = 'Migration_Start_Month', data = df)
plt.xlabel('Migration Start Month')
plt.ylabel('Frequency')
plt.title('Distribution of Migration Start Month')

The univariate analysis conducted on the numerical columns **('Flight_Distance_Km', 'Flight_Duration_hours', 'Average_Speed_kmph', 'Max_Altitude_m', 'Min_Altitude_m', 'Temperature_C', 'Wind_Speed_kmph', 'Humidity_%', 'Pressure_hPa', 'Visibility_km')** reveals a lack of significant outliers and minimal skewness in their distributions. Similarly, the categorical columns **('Species', 'Region', 'Habitat', 'Weather_Condition', 'Migration_Reason')** show the frequency of each category.

## Findings from Univariate Analysis:

* The Typical flight distance of the birds is between 2000 kms - 3000 kms

* Most birds have a flight hour between 35 hrs to 65 hrs

* The peak migration periods appear to be:

    * Starts: Primarily around March, with significant activity also in January, October, November, and February.

    * Ends: Primarily around April and November, with considerable activity also in March and December.

# Other Considerations


Here, We're going to consider the following as the 'Interrupted_Reason' being present alongside 'Migration_Success' as 'Successful',

* **Temporary Interruption:** The recorded interruption (e.g., storm, minor injury, lost signal) might have been temporary, allowing the bird to resume and successfully complete its migration.

* **Partial Migration Success:** 'Migration_Success' might be defined based on reaching a general destination or breeding grounds, even if a part of the journey experienced an interruption.


In [None]:
df[['Interrupted_Reason' ,'Migration_Success']].head(15)

# Question 1:

## Do certain species migrate in larger flocks?

In [None]:
df[['Migrated_in_Flock', 'Species', 'Flock_Size']].head(15)

## Considerations
###  **Note on 'Migrated_in_Flock' and 'Flock_Size':** Although 'Migrated_in_Flock' might be marked as 'No', the 'Flock_Size' column contains values other than zero. As its a synthetic data we take the following into considerations

* **Individual Tracking within a Flock:** Even if the overall Migrated_in_Flock is marked as 'No' for a specific bird, the Flock_Size might still represent the size of a flock it was observed with at some point during its migration, even if it wasn't considered to be actively migrating as part of that flock for the entire journey.

* **Initial or Final Observation:** The Flock_Size recorded might be the size of a group the bird was seen with at the beginning of its migration before separating, or at the end when joining others at a destination. The 'Migrated_in_Flock' status might refer to the majority of the migration journey.

* **"No" Meaning "Not Primarily in a Flock":** The 'No' in Migrated_in_Flock might not strictly mean the bird was entirely alone for its entire migration. It could indicate that its primary mode of migration was solitary, even if it occasionally encountered or briefly traveled with other birds. The Flock_Size might capture these instances.

In [None]:
df.groupby(['Migrated_in_Flock', 'Species'])['Flock_Size'].count()

In [None]:
Migrated_in_Flock_df = df[df['Migrated_in_Flock'] == 'Yes']

In [None]:
sns.kdeplot(x = 'Flock_Size', hue = 'Species', data = Migrated_in_Flock_df)
plt.xlabel('Flock Size')
plt.ylabel('Frequency')

## Findings:

* Geese display a clear peak in the larger flock size range (around 350-500), suggesting that when Geese migrate in flocks, they typically form larger groups compared to most other species in this selection.

In [None]:
df['Migrated_in_Flock'].value_counts()

# Question 2:

## How does weather impact nesting success?

## Chi - Squared Test

In [None]:
contingency_table = pd.crosstab(df['Weather_Condition'], df['Nesting_Success'])

In [None]:
print(contingency_table)

In [None]:
Chi2, p_value, dof, expected = chi2_contingency(contingency_table)

In [None]:
print('Chi - Sqared Statistic:', Chi2)
print('P-Value:', p_value)
print('Degrees of Freedom:', dof)
print('Expected Frequency:', expected)

In [None]:
alpha = 0.05
if p_value < alpha:
    print("\nThe association between Weather Condition and Nesting Success is statistically significant (p < 0.05).")
else:
    print("\nThere is no statistically significant association between Weather_Condition and Nesting_Success (p >= 0.05).")

## ANOVA

In [None]:
numerical_weather = ['Temperature_C', 'Wind_Speed_kmph', 'Humidity_Percent', 'Pressure_hPa', 'Visibility_km']

In [None]:
df.columns

In [None]:
df.rename(columns={'Humidity_%': 'Humidity_Percent'}, inplace=True) #stats model might not work properly with special characters

In [None]:
for col in numerical_weather:
    print(f"\nANOVA for {col} vs. Nesting Success:")
    formula = f'{col} ~ C(Nesting_Success)'
    model = smf.ols(formula, data=df).fit()
    anova_table = anova_lm(model)
    print(anova_table)

## **Findings**:

### **Findings from Chi-Squared Test (Categorical Weather Condition):**

The Chi-squared test between 'Weather_Condition' and 'Nesting_Success' yielded a p-value of 0.2545. This p-value is greater than the common significance level of 0.05. Therefore, we fail to reject the null hypothesis.

This indicates that there is no statistically significant association between the categorical weather condition (Clear, Foggy, Rainy, Stormy, Windy) and nesting success in your dataset. The observed distribution of successful and unsuccessful nests across the different weather conditions is not significantly different from what we would expect if there were no relationship between them.

### **Findings from ANOVA (Numerical Weather Variables):**
Based on these ANOVA results, for each of the numerical weather variables tested individually, there is no statistically significant evidence to suggest that the mean weather condition was different for successful nesting events compared to unsuccessful nesting events in the dataset. As the p-value is higher than 0.05.

There is no strong statistical evidence to conclude that weather, as represented by these variables, has a significant impact on nesting success.

# Question 3:

### What conditions lead to migration interruptions?

In [None]:
df[df['Migration_Interrupted'] == 'Yes']['Interrupted_Reason'].value_counts()

## Findings:

In summary, based on this data, the primary conditions leading to recorded migration interruptions appear to be adverse weather (storms), physical injury to the birds, and encounters with predators. Loss of tracking signal is also noted as a reason, but less frequently than the other three.

# Geospatial Analysis

In [None]:
geometry_start = [Point(xy) for xy in zip(df['Start_Longitude'], df['Start_Latitude'])]
gdf_start = gpd.GeoDataFrame(df, geometry=geometry_start, crs="EPSG:4326")

In [None]:
geometry_end = [Point(xy) for xy in zip(df['End_Longitude'], df['End_Latitude'])]
gdf_end = gpd.GeoDataFrame(df, geometry=geometry_end, crs="EPSG:4326")

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

gdf_start.plot(ax=ax, marker='o', color='blue', markersize=1, alpha=0.3, label='Start Locations')
gdf_end.plot(ax=ax, marker='x', color='red', markersize=1, alpha=0.3, label='End Locations')

ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("Start and End Locations (Reduced Size & Transparency)")
ax.legend(loc='upper right')
plt.show()