# Multiple Linear Regression

## Predicting Food Prices in Nigreia
- The Economics of Eating: Predicting Food Price Trends in Nigeria
- This dataset contains Food Prices data for Nigeria, sourced from the World Food Programme Price Database. The World Food Programme Price Database covers foods such as maize, rice, beans, fish, and sugar for 98 countries and some 3000 markets. It is updated weekly but contains to a large extent monthly data. The data goes back as far as 1992 for a few countries, although many countries started reporting from 2003 or thereafter.

## Exploratory Data Analysis: Unveiling Patterns and Insights

    1. Introduction
        Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.

    2. Objectives of Exploratory Data Analysis
        1. Understand Data Characteristics:
            Gain insights into the distribution, central tendency, and variability of the data.
            Identify the presence of missing values, outliers, and anomalies.

        2. Explore Relationships:
            Examine correlations and dependencies between different variables.
            Uncover potential patterns and trends within the dataset.
        
        3. Visualize Data Distributions:
            Utilize graphical representations to visualize the distribution of data.
            Choose appropriate plots such as histograms, box plots, and scatter plots.
            
        4. Identify Patterns and Anomalies:
            Uncover hidden patterns that may not be apparent in raw data.
            Detect outliers and anomalies that could impact analysis outcomes.
            
            
    3. Techniques and Tools
        1. Descriptive Statistics:
            Calculate measures such as mean, median, and standard deviation.
            Utilize summary statistics to provide an overview of the dataset.
            Data Visualization:

            Employ graphical representations like histograms, box plots, and scatter plots.
            Create visualizations to illustrate trends, patterns, and relationships.
            Correlation Analysis:

            Use correlation matrices to quantify the relationships between variables.
            Identify strong positive/negative correlations and potential multicollinearity.
            Outlier Detection:

            Apply statistical methods or visual inspection to identify outliers.
            Assess the impact of outliers on the analysis and consider appropriate handling.

    4. Steps in Exploratory Data Analysis
        1. Data Collection:
            Gather the raw dataset from reliable sources.

        2. Data Cleaning:
            Handle missing values, duplicate entries, and inconsistencies.
            Ensure data is in a suitable format for analysis.

        3. Descriptive Statistics:
            Compute basic statistics to describe the central tendency and dispersion.

        4. Visualization:
            Generate visualizations to explore data distributions and relationships.

        5. Correlation Analysis:
            Investigate correlations between variables.

        6. Outlier Detection:
            Identify and analyze outliers to understand their impact.

    5. Case Study: Applying EDA to Real-World Data
        Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.

    6. Conclusion
        Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.

    7. References
        Include references to any tools, libraries, or methodologies used in the EDA process.


### Import Libraries

In [None]:
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Documentation
import handcalcs.render

# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px


from sympy import Sum, symbols, Indexed, lambdify, diff
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

In [None]:
# Path
data_path = './Data/'

### Import Dataset

In [None]:
raw_data = pd.read_csv(data_path+"wfp_food_prices_nga.csv",  low_memory=False).reset_index(drop=True)
raw_data.shape

In [None]:
raw_data

In [None]:
column_info = raw_data.iloc[0]
column_info

In [None]:
raw_data = raw_data.drop(0).reset_index(drop=True)
raw_data

In [None]:
# Rename the columns
raw_data.rename(columns={ 'date': 'Date', 'admin1': 'Admin1', 'admin2': 'Admin2', 'market': 'Market', 'latitude': 'Latitude', 'longitude': 'Longitude', 'category': 'Category', 
                     'commodity': 'Commodity', 'unit': 'Unit', 'priceflag': 'Price_Flag', 'pricetype': 'Price_Type', 'currency': 'Currency', 'price': 'Price', 'usdprice':'USD_Price' }, inplace=True)
raw_data.columns

In [None]:
raw_data.info()

In [None]:
raw_data['Date'] = raw_data['Date'].astype('datetime64[ns]')
raw_data['Latitude'] = raw_data['Latitude'].astype(float)
raw_data['Longitude'] = raw_data['Longitude'].astype(float)
raw_data['Price'] = raw_data['Price'].astype(float)
raw_data['USD_Price'] = raw_data['USD_Price'].astype(float)

In [None]:
raw_data

In [None]:
raw_data.describe()

In [None]:
raw_data.nunique()

### Finding for missing values

In [None]:
raw_data.isnull().sum()

In [None]:
for column in raw_data.columns:
    if column not in ['Date', 'Latitude', 'Longitude', 'Price', 'USD_Price']:
        print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
        print(raw_data[column].unique())
        print("--------------------------------------------------------------------------------------------------------------")
        

In [None]:
raw_data.Category.value_counts()

In [None]:
raw_data.Commodity.value_counts()

In [None]:
raw_data['Unit'].value_counts()

In [None]:
raw_data['Price_Flag'].value_counts()

In [None]:
raw_data['Price_Type'].value_counts()

In [None]:
# Removing Outlier column
raw_data.drop('Currency',  axis=1, inplace=True)

### Encoding categorical data

Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:

1. Label Encoding:
    - Assigns a unique integer to each category.
    - Suitable for ordinal data where the order matters.
    - Sklearn provides LabelEncoder for this purpose.
<br></br>
2. One-Hot Encoding:
    - Creates binary columns for each category and represents the presence of a category with a 1.
    - Suitable for nominal data where there is no inherent order.
<br></br>
3. Ordinal Encoding:
    - Manually assign numerical values based on the order of categories.
    - Useful when there is an inherent order among categories.
<br></br>   
4. Binary Encoding:
    - Converts categories into binary code.
    - Reduces the number of columns compared to one-hot encoding.
<br></br>
5. Hashing Encoding:
    - Converts categories into a fixed-size hash, useful when dealing with high cardinality.

In [None]:
pd.get_dummies(raw_data['Category'])

In [None]:
pd.get_dummies(raw_data['Price_Flag'])

In [None]:
pd.get_dummies(raw_data['Price_Type'])

In [None]:
pd.get_dummies(raw_data['Unit'])

In [None]:
# This data has been seperated as test data because the price has to be predicted
test_data = raw_data[raw_data['Price_Flag'] == 'forecast'].reset_index(drop=True)
test_data

In [None]:
raw_data = raw_data[raw_data['Price_Flag'] != 'forecast'].reset_index(drop=True)
raw_data

### One-Hot Encoding

In [None]:
encoder = OneHotEncoder()

In [None]:
def encoding_categorical_data(df, column):
    df = pd.concat([df, pd.DataFrame(encoder.fit_transform(df[[column]]).toarray(), columns=encoder.get_feature_names_out([column])).astype(int)], axis=1) 
    # Removing un necessary column
    df.drop(column,  axis=1, inplace=True)
    return df

In [None]:
data = raw_data.copy()
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    data = encoding_categorical_data(data, col)
    
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    test_data = encoding_categorical_data(test_data, col)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
test_data.info()

In [None]:
data

In [None]:
data = data.drop(['Date', 'Admin1', 'Admin2', 'Market'], axis=1)

### Data Exploration

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))  # You can adjust the width and height as needed

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(data['Category'].unique()))

# Get the top 8 categories (you can adjust the number as needed)
top_categories = data['Category'].value_counts().index

# Filter the data to include only the top categories
filtered_data = data[data['Category'].isin(top_categories)]

# Creating the bar plot with different colors for each category
sns.barplot(x='Category', y='USD_Price', hue='Category', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('Bar Plot of USD Price by Category')
plt.xlabel('Category')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Category')

# Show the plot
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(10, 18))  # Swapping width and height for a horizontal plot

# Get the top categories
top_categories = data['Commodity'].value_counts().index

# Filter the data to include only the top categories
filtered_data = data[data['Commodity'].isin(top_categories)]

# Creating the horizontal bar plot with different colors for each category
sns.barplot(x='USD Price', y='Commodity', hue='Commodity', data=filtered_data)

# Adding titles and labels
plt.title('Plot of USD Price by Commodity')
plt.xlabel('USD_Price')
plt.ylabel('Commodity')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=sns.color_palette()[i % len(sns.color_palette())], markersize=10) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Commodity')

# Show the plot
plt.show()

In [None]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(14, 5))

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(data['Price_Type'].unique()))

# Get Price Type 
top_price_type = data['Price_Type'].value_counts().index

# Filter the data to include only the top Price Type
filtered_data = data[data['Price_Type'].isin(top_price_type)]

# Creating the bar plot with different colors for each Price Type
ax = sns.barplot(x='Price_Type', y='USD_Price', hue='Price_Type', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('USD Price by Price Type')
plt.xlabel('Price Type')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_price_type)]
plt.legend(legend_handles, top_price_type, title='Price Type')

# Adding annotations on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Show the plot
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(data["Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(data["USD_Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

sns.kdeplot(data["USD_Price"])

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=data['Price'], y=data['Commodity'])
plt.title('Scatter Plot of NGN Price vs Commodity')
plt.xlabel('Commodity')
plt.ylabel('NGN Price')
plt.show()

In [None]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=data['USD_Price'], y=data['Commodity'])
plt.title('Scatter Plot of USD Price vs Commodity')
plt.xlabel('X1')
plt.ylabel('y')
plt.show()

In [None]:
sns.pairplot(data)
plt.show()

In [None]:
sns.pairplot(data, kind = 'reg' , plot_kws = {'line_kws':{'color': 'red'}})
plt.show()

In [None]:
co_relation = data.corr()

In [None]:
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.heatmap(co_relation, xticklabels=co_relation.columns, yticklabels=co_relation.columns, annot=True)

###  Data Spliting