<a href="https://colab.research.google.com/github/RohanC07/Programming_for_DataScience/blob/main/Programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This dataset represents the **All India Consumer Price Index (CPI)** for both rural and urban areas up to September 2014. The CPI measures changes in the average price level of a basket of consumer goods and services over time, serving as an important indicator of inflation in the country. The dataset covers various regions, helping to analyze price trends and inflationary pressures in rural and urban sectors. The data is crucial for understanding regional price disparities and evaluating the impact of inflation on different population segments.


**(1)** **Importing the CSV file into the Google Colab!**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**(2)** Importing necessary libraries and loading the CSV file using **Pandas,numpy,pyplot**,**seaborn** and **plotly express**

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import os
import glob
from scipy.stats import ttest_ind
from scipy import stats
import plotly.express as px

# *Dataset Description*
# Load your dataset (replace with your file path)
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Assesment/Programming for Data Science/datafile.csv")
data


**(3) Visualisation** of the data will help in understanding and interpreting the data.

In [None]:
# Basic Summary Statistics
def basic_summary_statistics(data):
    """Generate basic summary statistics."""
    print("\nBasic Summary Statistics:")
    print(data.describe())  # For numerical data
    print("\nCategorical Data Frequencies:")
    for col in data.select_dtypes(include=['object']).columns:
        print(f"{col} - \n{data[col].value_counts()}")

# Ensure the "Year" column is strictly numeric
data["Year"] = pd.to_numeric(data["Year"], errors="coerce")
# Drop rows where "Year" is missing or invalid
data = data.dropna(subset=["Year"])
# Ensure "Year" is an integer
data["Year"] = data["Year"].astype(int)
# Clean the "Housing" column to ensure it contains only numeric values
data["Housing"] = pd.to_numeric(data["Housing"], errors="coerce")
# Impute missing "Housing" values with the mean
housing_mean = data["Housing"].mean()
data["Housing"].fillna(housing_mean, inplace=True)
# Select numeric columns for analysis (excluding "Year")
numeric_columns = data.select_dtypes(include=["number"]).columns
numeric_columns_without_year = [col for col in numeric_columns if col != "Year"]
# Group by "Year" and calculate mean for the numeric columns (excluding "Year")
grouped_data = data.groupby("Year")[numeric_columns_without_year].mean().reset_index()

# Plot the trends for the top 10 indicators
plt.figure(figsize=(14, 8))
top_10_indicators = grouped_data.mean().sort_values(ascending=False).head(10).index.tolist()

for indicator in top_10_indicators:
    if indicator in numeric_columns_without_year:  # Only include numeric indicators
        plt.plot(grouped_data["Year"], grouped_data[indicator], marker="o", label=indicator)

plt.title("Top 10 Indicators Over Years", fontsize=16)
plt.xlabel("Year", fontsize=14)
plt.ylabel("Average Value", fontsize=14)
plt.legend(title="Indicators", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Ensure the "Year" column is strictly numeric
data["Year"] = pd.to_numeric(data["Year"], errors="coerce")

# Drop rows where "Year" is missing or invalid
data = data.dropna(subset=["Year"])

# Ensure "Year" is an integer
data["Year"] = data["Year"].astype(int)

# Clean the "Housing" column to ensure it contains only numeric values
data["Housing"] = pd.to_numeric(data["Housing"], errors="coerce")

# Optionally: Impute missing "Housing" values with the mean
housing_mean = data["Housing"].mean()
data["Housing"].fillna(housing_mean, inplace=True)

# Filter data for Urban and Rural sectors
urban_rural_data = data[data["Sector"].isin(["Urban", "Rural"])]

# Group by Year and Sector to calculate average CPI and Health
urban_rural_cpi_health = urban_rural_data[["Year", "Sector", "General index", "Health"]]
urban_rural_cpi_health_grouped = urban_rural_cpi_health.groupby(["Year", "Sector"]).mean().reset_index()

# Scatter plot with regression lines for CPI vs. Health for Urban and Rural
plt.figure(figsize=(14, 8))

# Scatter plot for Urban
sns.scatterplot(data=urban_rural_cpi_health_grouped[urban_rural_cpi_health_grouped["Sector"] == "Urban"],
                x="General index", y="Health", color="blue", label="Urban")
sns.regplot(data=urban_rural_cpi_health_grouped[urban_rural_cpi_health_grouped["Sector"] == "Urban"],
            x="General index", y="Health", scatter=False, color="blue", line_kws={"linestyle":"--"})

# Scatter plot for Rural
sns.scatterplot(data=urban_rural_cpi_health_grouped[urban_rural_cpi_health_grouped["Sector"] == "Rural"],
                x="General index", y="Health", color="red", label="Rural")
sns.regplot(data=urban_rural_cpi_health_grouped[urban_rural_cpi_health_grouped["Sector"] == "Rural"],
            x="General index", y="Health", scatter=False, color="red", line_kws={"linestyle":"--"})

# Titles and labels
plt.title("CPI (General Index) vs. Health: Urban vs Rural", fontsize=16)
plt.xlabel("CPI (General Index)", fontsize=14)
plt.ylabel("Health", fontsize=14)
plt.legend(title="Sector", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.grid(True)
plt.tight_layout()
plt.show()

**(4)** Using **Box plots** italicized text to understand the distribution, median, and potential outliers for numerical columns across different categorical columns

In [None]:
#Box plot for distribution
plt.figure(figsize=(8, 6))

#boxplot with a different color palette
sns.boxplot(x='Sector', y='General index', data=data,
            palette="Set2", width=0.6, fliersize=7, linewidth=2)

# Adding title and labels with enhanced styling
plt.title('General Index Distribution by Sector', fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Sector', fontsize=14)
plt.ylabel('General Index', fontsize=14)

# Adjust the y-axis gridlines for easier readability
plt.grid(True, axis='y', linestyle='--', alpha=0.6)

# Rotate x-axis labels if necessary for better readability
plt.xticks(rotation=0, ha='right')

# Ensure the layout is tight and polished
plt.tight_layout()
plt.show()

**(5) Advanced analysis** is important for uncovering deeper insights in the data. It will help to identify relationships between numerical variables through a correlation matrix, providing a clearer understanding of how different features influence each other.

In [None]:
# Advanced Analysis
def advanced_analysis(data):
    """Perform advanced data analysis, including correlation matrix and trend analysis."""
    numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
    data_numerical = data[numerical_columns]

    # Correlation matrix
    correlation_matrix = data_numerical.corr()
    print("Correlation Matrix:")
    print(correlation_matrix)

    # Plot Correlation Heatmap
    plt.figure(figsize=(20, 10))
    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
    sns.heatmap(
        correlation_matrix,
        mask=mask,
        annot=True,
        fmt='.2f',
        cmap='coolwarm',
        linewidths=0.5,
        annot_kws={"size": 10},
        cbar_kws={'shrink': 0.8, 'label': 'Correlation Coefficient'}
    )
    plt.title('Correlation Heatmap', fontsize=18, pad=20)
    plt.xticks(fontsize=12, rotation=90, ha='right')
    plt.yticks(fontsize=12)
    plt.tight_layout()
    plt.show()

    # Trend Analysis (if applicable)
    if 'Year' in data.columns and 'General index' in data.columns:
        plt.figure(figsize=(10, 6))
        data.groupby('Year')['General index'].mean().plot(label='General Index Trend')
        plt.title('General Index Over Time', fontsize=16)
        plt.xlabel('Year', fontsize=14)
        plt.ylabel('General Index', fontsize=14)
        plt.grid(True)
        plt.legend()
        plt.tight_layout()
        plt.show()

# Main Execution
if __name__ == "__main__":
    # Basic Summary Statistics
    basic_summary_statistics(data)

    # Advanced Analysis
    advanced_analysis(data)