<a href="https://www.kaggle.com/code/tasbihothman/improving-egypt-s-gdp-time-series-and-regression?scriptVersionId=144074696" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<center><img src="https://img.freepik.com/free-vector/gross-domestic-product-concept-growth-arrow-chart-with-globe-stacks-money-happy-tiny-professional_74855-10698.jpg?w=1060&t=st=1695232972~exp=1695233572~hmac=60906160de9353e294aec51c121f5de0908163bd74852300d1608958b3719642" width=600></center><br>

**Table of Contents** <br>

1.[-Preprocessing](#pre)<br>
>1.1[-Missing values](#missing)<br>
>1.2[-Outliers](#out)<br>
>1.3[-Label Encoding](#le)<br>
>1.3[-Train Test Split](#split)<br>
>1.4[-Scaling](#scale)<br>
>1.5[-VIF and dimensinality reduction](#vif)<br>



2.[-Analysis](#analysis)<br>
>2.1[-GDP and phone numbers](#phones)<br>
>2.2[-GDP in different countries](#diffgdp)<br>
>2.3[-GDP and regions](#region)<br>
>2.4[-Effect of Mortality on GDP](#mortality)<br>
>2.5[-Population Increase Rate](#PopInc)<br>
>2.6[-Each sector contribution in the GDP](#sector)<br>
>2.7[-Effect of Geography on GDP ](#geo)<br>
>2.8[-Population effect on GDP per capita](#pop)<br>
>2.9[-Total GDP analysis](#total)<br>
>2.10[-The Effect of Education on GDP](#edu)<br>
>2.11[-The Effect of Migration on GDP](#mig)<br>

3.[-Time Series](#time)<br>
>3.1[-Missing values](#TimeMiss)<br>
>3.2[-Time Analysis](#Tanalysis)<br>
>3.3[-ARIMA Modelling](#arima)<br>

4.[-Regression Modelling](#reg)<br>
>4.1[-Linear Regression](#lin)<br>
>4.2[-Ridge Regression](#ridge)<br>
>4.3[-Lasso Regression](#lasso)<br>
>4.4[-KNN Regressor](#knn)<br>
>4.5[-Auto ML](#auto)<br>
>>4.5.1[-ML Jar](#jar)<br>
>>4.5.2[-Pycaret](#caret)<br>



In [None]:
!pip install country-converter

In [None]:
!pip install pmdarima

In [None]:
!pip install mljar-supervised

In [None]:
!pip install pycaret

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
import plotly.graph_objs as go
import country_converter as coco

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.decomposition import PCA

from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse
from statsmodels.tsa.arima.model import ARIMA
import pmdarima as pm
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 8, 6
from statsmodels.graphics.tsaplots import plot_predict

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from supervised.automl import AutoML
from pycaret.regression import *

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Data initialization

In [None]:
df =pd.read_csv('/kaggle/input/countries-of-the-world/countries of the world.csv', sep=',', encoding = 'utf-8')
time= pd.read_csv('/kaggle/input/countries-gdp-2012-to-2021/GDP.csv', sep=',', encoding = 'utf-8')

In [None]:
df

In [None]:
df.columns.to_list()

In [None]:
df.info()

In [None]:
df[['Country', 'Region', 'Climate']].sample(25)

In [None]:
df[['Arable (%)','Crops (%)','Other (%)','Coastline (coast/area ratio)']]

In [None]:
df.describe()

In [None]:
df.describe(include = 'O')

In [None]:
df[df.duplicated()]

***
# preprocessing <a class='anchor' id='pre'></a>

### We need to convert the object columns that contain numerical data into numerical values to have better analysis

In [None]:
climate_mapping = {1: 1, 1.5: 2, 2: 3, 2.5: 4, 3: 5, 4: 6}
df['Climate'] = df['Climate'].replace(climate_mapping)
df.Climate.value_counts().sort_index()

In [None]:
cols =['Climate', 'Service', 'Industry', 'Agriculture', 'Deathrate', 'Birthrate',
       'Other (%)', 'Crops (%)', 'Arable (%)', 'Phones (per 1000)', 'Literacy (%)',
       'Infant mortality (per 1000 births)', 'Net migration', 'Coastline (coast/area ratio)',
       'Pop. Density (per sq. mi.)']
for col in cols:
    df[col] = df[col].str.replace(',', '.').astype(float)
df['Population'] =df['Population'] .astype(int)

In [None]:
df.info()

In [None]:
df.isnull().sum()

## missing values handling <a class='anchor' id='missing'></a>

In [None]:
msno.bar(df, figsize=(12,5))

### GDP Null values

In [None]:
df[df['GDP ($ per capita)'].isnull()==True]
# the data hasn't been updated since 2007 which was $2500 which is equivelant to $2955  after adding the inflation factor

In [None]:
df['GDP ($ per capita)'] =df['GDP ($ per capita)'].fillna(2955)

In [None]:
df[df['GDP ($ per capita)'].isnull()==True]

### climate null values

We can safely assume that the climate, literacy rate, etc in certain countries is similair to that of its surrounding countries. So we will group them and fill them in with the mean value of the regions.

In [None]:
region_mean_climate = df.groupby(['Region']).Climate.mean().round(0).to_dict()
df['Climate'].fillna(df['Region'].map(region_mean_climate), inplace=True)

### Literacy null values

In [None]:
#fill litracy manually
df[df['Literacy (%)'].isnull()==True]['Country']

In [None]:
df.loc[25,'Literacy (%)' ] = 87.9
df.loc[66,'Literacy (%)'] = 99 
df.loc[74,'Literacy (%)'] = 96.92
df.loc[78,'Literacy (%)'] = 80
df.loc[80,'Literacy (%)'] = 100
df.loc[85,'Literacy (%)'] = 100
df.loc[99,'Literacy (%)'] = np.nan # couldn't find any info
df.loc[104,'Literacy (%)'] = np.nan # couldn't find any info
df.loc[108,'Literacy (%)'] = 89.3
df.loc[123,'Literacy (%)'] = 98.4
df.loc[134,'Literacy (%)'] = 90
df.loc[144,'Literacy (%)'] = 96.5
df.loc[185,'Literacy (%)'] = 82.7
df.loc[187,'Literacy (%)'] = 76.6
df.loc[209,'Literacy (%)'] = 99
df.loc[220,'Literacy (%)'] = 97.8
df.loc[222,'Literacy (%)'] = 96.92
df.loc[223,'Literacy (%)'] = 50

In [None]:
region_mean_Literacy = df.groupby(['Region'])['Literacy (%)'].mean().round(0).to_dict()
df['Literacy (%)'].fillna(df['Region'].map(region_mean_Literacy), inplace=True)

In [None]:
df[df['Literacy (%)'].isnull()==True]

### Arable, crops and others missing values

In [None]:
#with missing
category_labels = ['Arable', 'Crops', 'Other']
category_values = [df['Arable (%)'].sum(), df['Crops (%)'].sum(), df['Other (%)'].sum()] 

plt.figure(figsize=(6, 6))  
plt.pie(category_values, labels=category_labels, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Data')

plt.axis()  
plt.show()

In [None]:
region_mean_Arable = df.groupby(['Region'])['Arable (%)'].mean().round(0).to_dict()
df['Arable (%)'].fillna(df['Region'].map(region_mean_Arable), inplace=True)

region_mean_Crops = df.groupby(['Region'])['Crops (%)'].mean().round(0).to_dict()
df['Crops (%)'].fillna(df['Region'].map(region_mean_Crops), inplace=True)

# since the arable, crops and others are complementry we can fill by subtracting them 
other_fill =100-df['Crops (%)']- df['Arable (%)']
df['Other (%)'].fillna(other_fill, inplace=True)

In [None]:
#without missing
category_labels = ['Arable', 'Crops', 'Other']
category_values = [df['Arable (%)'].sum(), df['Crops (%)'].sum(), df['Other (%)'].sum()] 

plt.figure(figsize=(6, 6))  
plt.pie(category_values, labels=category_labels, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Data')

plt.axis()  
plt.show()

#the same ratio the before and after

In [None]:
msno.matrix(df)

###  Agriculture, industry and service  null values

In [None]:
#with missing
category_labels = ['Agriculture', 'Industry', 'Service']
category_values = [df['Agriculture' ].sum(), df['Industry'].sum(), df['Service'].sum()] 

plt.figure(figsize=(6, 6))  
plt.pie(category_values, labels=category_labels, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Data')

plt.axis()  
plt.show()

In [None]:
#with missing
sns.kdeplot(x=df['Agriculture'], color='green')
sns.kdeplot(x=df['Industry'], color='silver')
sns.kdeplot(x=df['Service'], color='gold')

In [None]:
data = df[['Agriculture', 'Industry', 'Service']]
# Function to calculate the missing column
def calculate_missing_column(row):
    if not np.isnan(row['Agriculture']) and not np.isnan(row['Industry']):
        row['Service'] = 1 - row['Agriculture'] - row['Industry']
    elif not np.isnan(row['Agriculture']) and not np.isnan(row['Service']):
        row['Industry'] = 1 - row['Agriculture'] - row['Service']
    elif not np.isnan(row['Industry']) and not np.isnan(row['Service']):
        row['Agriculture'] = 1 - row['Industry'] - row['Service']
    return row

# Apply the function to calculate missing columns
data = data.apply(calculate_missing_column, axis=1)

# Use KNNImputer for remaining missing values
imputer = KNNImputer(n_neighbors=2)
data = imputer.fit_transform(data)

# Convert the result back to a DataFrame
data = pd.DataFrame(data, columns=['Agriculture', 'Industry', 'Service'])

In [None]:
#without missing
category_labels = ['Agriculture', 'Industry', 'Service']
category_values = [data['Agriculture' ].sum(), data['Industry'].sum(), data['Service'].sum()] 

plt.figure(figsize=(6, 6))  
plt.pie(category_values, labels=category_labels, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Data')

plt.axis()  
plt.show()

In [None]:
#without missing
sns.kdeplot(x=data['Agriculture'], color='green')
sns.kdeplot(x=data['Industry'], color='silver')
sns.kdeplot(x=data['Service'], color='gold')

the graphs has the same distribution so the knn imputer done a great job implotting the missing values

In [None]:
df[['Agriculture', 'Industry', 'Service']]=data[['Agriculture', 'Industry', 'Service']]

### handling the missing values of the rest columns

In [None]:
sns.heatmap(df.isnull())

In [None]:
df.isnull().sum()

In [None]:
#with missing
fig, axs = plt.subplots(2, 3, figsize=(7, 7))
sns.kdeplot(x=df['Net migration'], color='green', ax=axs[1, 0]). set_xlabel('Net migration')
sns.kdeplot(x=df['Infant mortality (per 1000 births)'], color='silver', ax=axs[1, 1]). set_xlabel('Infant mortality (per 1000 births)')
sns.kdeplot(x=df['Phones (per 1000)'], color='gold', ax=axs[0, 0]). set_xlabel('Phones (per 1000)')
sns.kdeplot(x=df['Birthrate'], color='blue', ax=axs[0, 1]). set_xlabel('Birthrate')
sns.kdeplot(x=df['Deathrate'], color='teal', ax=axs[0, 2]). set_xlabel('Deathrate')

In [None]:
data =df.copy()
data =data.drop(['Country', 'Region'], axis=1)

In [None]:
imp = IterativeImputer()
imputed =imp.fit_transform(data)
df_imputed = pd.DataFrame(imputed, columns=data.columns)
df_imputed

In [None]:
df[['Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service']] = df_imputed

In [None]:
#without missing
fig, axs = plt.subplots(2, 3, figsize=(7, 7))
sns.kdeplot(x=df['Net migration'], color='green', ax=axs[1, 0]). set_xlabel('Net migration')
sns.kdeplot(x=df['Infant mortality (per 1000 births)'], color='silver', ax=axs[1, 2]). set_xlabel('Infant mortality (per 1000 births)')
sns.kdeplot(x=df['Phones (per 1000)'], color='gold', ax=axs[0, 0]). set_xlabel('Phones (per 1000)')
sns.kdeplot(x=df['Birthrate'], color='blue', ax=axs[0, 1]). set_xlabel('Birthrate')
sns.kdeplot(x=df['Deathrate'], color='teal', ax=axs[0, 2]). set_xlabel('Deathrate')

In [None]:
sns.heatmap(df.isnull())

In [None]:
df.info()

In [None]:
df.Climate =df.Climate.astype(int)

In [None]:
df.info()

***
# Analysis <a class="anchor" id="analysis"></a> 

In [None]:
corr = df.drop(['Country', 'Region'], axis = 1).corr()
corr

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot = True, cmap = 'RdBu')

**From the correlation heatmap we find that the highest factor contributing to the GDP is the number of phones per 1000**

***
## The relation between GDP and number of phones <a class='anchor' id='phones'></a>

In [None]:
sns.scatterplot(data = df, x = 'Phones (per 1000)', y = 'GDP ($ per capita)', color = 'blue')

**We can see that there is a linear relation between the number of phones and GDP and they are directly proportional but the true question is which is affected by the other?**<br>
The answer is the number of phones is affected by GDP because as the GDP increases, the economical state of the people becomes better which will increase their consuming behaviour and they will hav the ability to have more devices.

## GDP in different countries <a class='anchor' id='diffgdp'></a>

 We need to change the countries names to three letters code for the plotly express maps

In [None]:
# Create a function to convert country names to alpha-3 codes
def country_to_alpha3(country_name):
    try:
        alpha3 = coco.convert(names=country_name, to='ISO3' ,not_found=np.nan)
        if alpha3:
            return alpha3
        else:
            return None
    except Exception as e:
        return None

# Apply the function to the 'Country' column to create a new 'Country Code' column
df['Country Code'] = df['Country'].apply(country_to_alpha3)

# Display the DataFrame with 'Country' and 'Country Code' columns
print(df[['Country', 'Country Code']])

In [None]:
df['Country Code'].isnull().sum()

In [None]:
df[df['Country Code'].isnull()==True]

In [None]:
df.loc[220,'Country Code' ] ='VIR'
df.loc[147,'Country Code'] ='ANT'

In [None]:
df['Country Code'].isnull().sum()

In [None]:
fig = px.choropleth(df, locations="Country Code",
                    color='GDP ($ per capita)',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Plasma)

fig.show()

***
## Is GDP affected by Regions? <a class='anchor' id='region'></a>

In [None]:
plt.figure(figsize=(20, 5))
plt.title('GDP per capita per region', fontsize=15)
plt.xticks(fontsize=10, rotation =0)
plt.ylabel('GDP ($ per capita)', fontsize=15)
plt.yticks(fontsize=10)
plt.xlabel('Region', fontsize=15)
sns.barplot(data = df, x = 'Region', y = 'GDP ($ per capita)')

In [None]:
Regions_by_Area = df.groupby('Region').sum().sort_values('Area (sq. mi.)', ascending = False)
Regions_by_Area = Regions_by_Area['Area (sq. mi.)']
Regions_by_Area

In [None]:
plt.figure(figsize=(20, 5))
plt.title('GDP per capita per Area', fontsize=15)
plt.ylabel('GDP ($ per capita)', fontsize=15)
plt.yticks(fontsize=10)
Regions_by_Area.plot(kind = 'barh', color = ['#809BCE', '#95B8D1', '#B8E0D2', '#D6EADF', '#EAC4D5', '#ECC9D9', '#EECEDC', '#F0D2DF', '#F1D6E2', '#F2DAE5', '#F4DAE8'])

**The answer to the question is yes, because even though the area of Northern America is less than the Sub-Saharan Africa, Northern American countries have higher GDP.**

***
## The Effect of Mortality (BirthRate, InfantMortality, DeathRate) on GDP <a class='anchor' id='mortality'></a>

In [None]:
df[['Birthrate', 'Deathrate', 'Infant mortality (per 1000 births)', 'GDP ($ per capita)']]

In [None]:
sns.scatterplot(data = df, x = 'Infant mortality (per 1000 births)', y = 'GDP ($ per capita)', color = 'red', hue = 'Deathrate')

In [None]:
df2=df.copy()

In [None]:
df2.groupby('Country').sum()['GDP ($ per capita)'].sort_values(ascending =False)

**From the previous scatter plot, we can conclude that GDP is inversely proportional with the death rate and infant mortality**

***
## Population Increase Rate <a class='anchor' id='PopInc'></a>

In [None]:
df2= df.copy()

In [None]:
df2['population increase rate'] = df2['Birthrate'] - df2['Deathrate'] 

In [None]:
pop_decrease = df2[df2['population increase rate']<=0]

In [None]:
sns.kdeplot(x= pop_decrease['GDP ($ per capita)'], color='#3f3f3f', fill=True, alpha=0.1, label ='population decrease')
sns.kdeplot(x= df2[df2['population increase rate']>0]['GDP ($ per capita)'], color ='#193eb0', fill=True, alpha=0.1, label ='population increase')
plt.legend()

**countries with population decreases tends to have higher GDP per average**

### does population decrease have any relation with healthcare system?

In [None]:
df2['pop_decrease'] = df2['population increase rate']
df2['pop_decrease']= np.where(df2['pop_decrease'] <= 0, 'yes', 'no')

In [None]:
df2['pop_decrease'].value_counts()

In [None]:
top =df2.groupby('Country').sum()['GDP ($ per capita)'].sort_values(ascending =False).head(50).index
worst =df2.groupby('Country').sum()['GDP ($ per capita)'].sort_values(ascending =False).tail(50).index
df2['Top_Worst'] = df['Country'].apply(lambda x: 'Top' if x in top else ('Worst' if x in worst else 'Other'))

In [None]:
fig = px.scatter(df2, x='population increase rate', y='Infant mortality (per 1000 births)', color='Top_Worst' , title="Scatter Plot")
fig.update_layout(
    title="Population increase rate vs Infant mortality",
    height=400
)
fig.show()

**most of population decrease isn't because of flaws in health care system but it more in the society norms and lower birth rates**

***
## What contributes more to the GDP? Agriculture or Industry or Services? <a class='anchor' id='sector'></a>

In [None]:
# First let's plot the top 10 countries in GDP and analyze them
group_by_Etype = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[['Agriculture', 'Industry', 'Service']]
# group_by_Etype.concat()
group_by_Etype    # Etype : Economical type

In [None]:
top10 = group_by_Etype.head(10)
top10

In [None]:
group_by_Etype.index

In [None]:
Egy_E= df[df['Country Code'] =='EGY'].groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[['Agriculture', 'Industry', 'Service']]

In [None]:
Egy_E

In [None]:
Egy_E.T.plot.pie(autopct = "%1.1f%%", subplots=True, colors= ['#98D8AA','#F3E99F','#FF6D60'])

In [None]:
group_by_Etype = pd.concat([top10,Egy_E ])

In [None]:
plt.figure(figsize=(10, 6))
ax = group_by_Etype.plot(kind='bar', stacked=True, colormap='Set3')

# Customize the plot
plt.xlabel('Country')
plt.ylabel('sector ratios')
plt.title('Stacked Bar Plot of Top 10 Countries by GDP and Economic Type')

# Add a legend
plt.legend(title='Economic Type', loc='upper right')

# Show the plot
plt.show()

In [None]:
dfc = df.copy()
dfc[['Agriculture', 'Industry', 'Service']] = dfc[['Agriculture', 'Industry', 'Service']]*100

In [None]:
fig = px.choropleth(dfc, locations="Country Code",
                    color='Agriculture',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Aggrnyl_r)
fig.show()

In [None]:
fig = px.choropleth(dfc, locations="Country Code",
                    color='Industry',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Burg)
fig.show()

In [None]:
fig = px.choropleth(dfc, locations="Country Code",
                    color='Service',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Agsunset)
fig.show()

***
## Does Geography of the country affect the GDP? <a class='anchr' id='geo'></a>

In [None]:
df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['GDP ($ per capita)','Area (sq. mi.)','Coastline (coast/area ratio)','Arable (%)','Crops (%)','Other (%)']]

In [None]:
group_by_Gtype = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Arable (%)','Crops (%)','Other (%)']]
# Gtype : Geography Type

In [None]:
Egy_G= df[df['Country Code'] =='EGY'].groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[['Arable (%)','Crops (%)','Other (%)']]

In [None]:
Egy_G

In [None]:
Egy_G.T.plot.pie(autopct = "%1.1f%%", subplots=True, colors= ['#C7E8CA','#539165','#F7C04A'])

In [None]:
group_by_Gtype = pd.concat([group_by_Gtype,Egy_G])
group_by_Gtype

In [None]:
plt.figure(figsize=(10, 6))
ax = group_by_Gtype.plot(kind='bar', stacked=True, colormap='Set2')

# Customize the plot
plt.xlabel('Country')
plt.ylabel('Coast line and geographic ratios')
plt.title('Stacked Bar Plot of Top 10 Countries by GDP and Geographic Type')

# Add a legend
plt.legend(title='Economic Type', loc='upper right')

# Show the plot
plt.show()

In [None]:
group_by_Gtype = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Coastline (coast/area ratio)']].plot(kind = 'bar')

**Fom the previous graphs we can observe that the top 10 countries in GDP per capita have similar geographies which implements that the geography of the country has an effect on the GDP per capita**

### Is services tourism only?

In [None]:
sns.heatmap(df[['Coastline (coast/area ratio)', 'Service']].corr(), annot = True, cmap = 'cividis')

In [None]:
fig = px.scatter(df, x='Coastline (coast/area ratio)', y='Service', title="Scatter Plot")
fig.update_layout(
    title="Coastline ratios vs Service percent",
    height=400
)
fig.show()

**Using the coastline ratio,the answer to the question is no, service is not tourism only countries because there is no clear relation between services and coastline. Therefore, The services may contain money from tourism, import, export and external invests.** <br>
**According to *[investopedia](https://www.investopedia.com/terms/s/service-sector.asp)*, services comprises various service industries including warehousing and transportation services; information services; securities and other investment services; professional services; waste management; health care and social assistance; and arts, entertainment, and recreation.**


***
## Does population benefit or harm to the GDP? <a class='anchor' id='pop'></a>

In [None]:
df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Population','Area (sq. mi.)','Pop. Density (per sq. mi.)']]

In [None]:
group_by_Ptype = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Population']]
# Ptype : Population Type

In [None]:
Egy_P= df[df['Country Code'] =='EGY'].groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[['Population']]

In [None]:
group_by_Ptype = pd.concat([group_by_Ptype,Egy_P])
group_by_Ptype

In [None]:
plt.figure(figsize=(10, 6))
group_by_Ptype.plot(kind = 'bar', color = '#25A18E')

**We can see that the countries of low population has high GDP per capita (Considering USA an outlier) and that's obvious because GDP per Capita is inversely proportional to population where GDP per capita = GDP/population.**

---
## Total GDP analysis <a class='anchor' id='total'><a>

In [None]:
df_GDP = df.copy()
df_GDP['Total GDP'] = df_GDP['Population']*df_GDP['GDP ($ per capita)']
df_GDP[['Population','GDP ($ per capita)', 'Total GDP']]

In [None]:
df_GDP.groupby('Country').sum().sort_values('Total GDP', ascending = False).head(10)[['Population','Area (sq. mi.)','Pop. Density (per sq. mi.)', 'Arable (%)','Crops (%)']]

In [None]:
fig = px.choropleth(df_GDP, locations="Country Code",
                    color='Total GDP',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Emrld)
fig.show()

In [None]:
fig = px.choropleth(df_GDP, locations="Country Code",
                    color='Population',
                    hover_name='Country', # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Sunsetdark)
fig.show()

**From the previous maps and graphs we can observe that the richest countries have high population**

***
## The Effect of Education on GDP <a class='anchor' id='edu'></a>

In [None]:
df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Population','Literacy (%)']]

In [None]:
group_by_Lit = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Literacy (%)']]
# Lit :  Literacy

In [None]:
Egy_L= df[df['Country Code'] =='EGY'].groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[['Literacy (%)']]

In [None]:
group_by_Lit = pd.concat([group_by_Lit,Egy_L])
group_by_Lit

In [None]:
plt.figure(figsize=(16, 6))
# group_by_Lit.plot(kind = 'barh', color = '#3587A4')
plt.xticks(range(len(group_by_Lit.index)), group_by_Lit.index.to_list())
(markers, stemlines, baseline) = plt.stem(group_by_Lit['Literacy (%)'])
plt.setp(stemlines, linestyle="-", color="#BD93D8", linewidth=1.5 )
plt.setp(markers, markersize=10, color="#9799CA", markeredgewidth=1.5)

**The answer to the question is yes, Education affects the GDP per capita significantly and it is clear that the higher educated the country the higher will be its GDP per capita.**

***
## Net Migration Effect <a id='mig'></a>

In [None]:
df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Population', 'Net migration']]

In [None]:
group_by_Mig = df.groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False).head(10)[['Net migration']]
# Mig :  Migration

In [None]:
Egy_M= df[df['Country Code'] =='EGY'].groupby('Country').sum().sort_values('GDP ($ per capita)', ascending = False)[[ 'Net migration']]

In [None]:
group_by_Mig = pd.concat([group_by_Mig,Egy_M])
group_by_Mig

In [None]:
plt.figure(figsize=(16, 6))
plt.xticks(range(len(group_by_Mig.index)), group_by_Mig.index.to_list())
(markers, stemlines, baseline) = plt.stem(group_by_Mig[ 'Net migration'])
plt.setp(stemlines, linestyle="-", color="#42113C", linewidth=2 )
plt.setp(markers, markersize=10, color="#0A81D1", markeredgewidth=1)

**We can observe that the best countries in the world using GDP per Capita has positive net migration rates and that is logical because it means that the country is attractive for the investors and the talented**

***
# Time Series <a id='time'></a>

In [None]:
#keep countries only common between the 2 datsets
common_country_codes = set(time["Country Code"]).intersection(df["Country Code"])
time = time[time["Country Code"].isin(common_country_codes)]
time.shape

In [None]:
#creating dataframe with only egypt and it will be the targeted one 
EGY_time =time[time['Country Code']=='EGY'].drop('Country Name',axis=1).set_index('Country Code').iloc[:,5:].T

EGY_time.head()

In [None]:
#making the years the dataframe index
time_T =time.T
time_T.head(10)

In [None]:
time =time.drop(time.iloc[:,2:37], axis=1)
time

In [None]:
time.columns

In [None]:
time.isnull().sum()

In [None]:
time[time['1995'].isnull()==True]

In [None]:
time.info()

## missing values <a id='TimeMiss'></a>

In [None]:
time_TC=time_T.copy()

since the increase of the gdp isn't that signficant so we could use bfill or ffill

In [None]:
col_name =time_T.loc['Country Code'].to_dict()
code= time_TC.loc['Country Code']
country =time_TC.loc['Country Name']
time_TC =time_TC.rename(columns= col_name).drop(time_TC.index[0]).drop('Country Code').fillna(method='bfill').fillna(method='ffill')

In [None]:
time_TC.head()

In [None]:
time_TC.isnull().sum().sort_values(ascending =False)

RK, VGB, GIB all their values are null so they will be dropped

In [None]:
time_TC =time_TC.dropna(axis=1)

---
## Time Analysis <a id='Tanalysis'></a>

In [None]:
time_TC['Global'] = time_TC.mean(axis=1)

In [None]:
top5 = df.groupby('Country Code').sum()[['GDP ($ per capita)']].sort_values('GDP ($ per capita)', ascending= False).head(5)

In [None]:
top5

In [None]:
# First, set the first row as the new column names
time_T.columns = time_T.loc['Country Code']

# Then, drop the 'Country Code' row
time_T = time_T.drop(['Country Code', 'Country Name'])

# Now, your columns are renamed based on the 'Country Code' row
time_T.head()

In [None]:
time_top5 =time_T.loc[:, time_T.columns.isin(top5.index)]

In [None]:
time_top5.head()

In [None]:
time_top5 = time_top5.loc[time_top5.index >= '1965']
time_top5

In [None]:
time_TC_filtered = time_TC[time_TC.index >= '1965']

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(15, 10))
plt.subplots_adjust(hspace=1)

axs[0].set_title ('GDP of Egypt from the 60s',size=15)
axs[0].set_xlabel('Year',size=10)
axs[0].set_ylabel('GDP',size=10)
axs[0].set_xticklabels(EGY_time.index, rotation=45)
axs[0].plot(EGY_time.index,EGY_time,linestyle='dashed',marker='*', color='#004725')
axs[0].grid()

axs[1].set_title('Global GDP from the 60s', size=15)
axs[1].set_xlabel('Year', size=10)
axs[1].set_ylabel('GDP', size=10)
axs[1].set_xticklabels(time_TC_filtered.index, rotation=45)
axs[1].plot(time_TC_filtered.index, time_TC_filtered['Global'], linestyle='dashed', marker='x')
axs[1].grid()

axs[2].set_title('Top 5 from the 60s', size=15)
axs[2].set_xlabel('year', size=10)
axs[2].set_ylabel('GDP', size=10)
# axs[2].plot(time_top5.index, time_top5, color='skyblue')
axs[2].set_xticklabels(time_top5.index, rotation=45)
axs[2].grid()

for column in time_top5.columns:
    axs[2].plot(time_top5.index, time_top5[column], linestyle='dashed', marker='x', label =column)
axs[2].legend()

### Model decomposition

In [None]:
#multiplicative model
date_index = pd.date_range(start=str(EGY_time.index.min()), periods=len(EGY_time), freq='Y')
EGY_time.index = date_index
multiplicative_decomposition = seasonal_decompose(EGY_time, model='multiplicative')

plt.rcParams.update({'figure.figsize': (10, 5)})
multiplicative_decomposition.plot()
plt.show()


In [None]:
# Additive Model
EGY_time.index = date_index
multiplicative_decomposition = seasonal_decompose(EGY_time, model='additive')

plt.rcParams.update({'figure.figsize': (10, 5)})
multiplicative_decomposition.plot()
plt.show()


**the data follows the additive model which means the data components are independent from each other
the data has no seasonality**

### checking the stationarity of the data

In [None]:
def test_stationarity(timeseries):
    
    #Determine rolling statistics
    movingAverage = timeseries.rolling(window=6).mean()
    movingSTD = timeseries.rolling(window=6).std()
    
    #Plot rolling statistics
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(movingAverage, color='red', label='Rolling Mean')
    std = plt.plot(movingSTD, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey–Fuller test:
    print('Results of Dickey Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
    

In [None]:
test_stationarity(EGY_time)

**according to Rolling and Dickey Fuller Test the data isn't stationary (Random walk) especailly in the variance**

### PCAF and ACF

In [None]:
fig = plt.figure(figsize=  (8,8))
ax1 = fig.add_subplot(211)
fig = plot_acf(EGY_time , lags = 15 , ax=ax1)
ax2 = fig.add_subplot(212)
fig = plot_pacf(EGY_time, lags = 15 , ax=ax2)

### GDP Growth Rate

In [None]:
# calculate the change in GDP since prior year which is called Economic Growth
gdp_growth = EGY_time['EGY'].pct_change() * 100  # Multiply by 100 for percentage
gdp_growth_df = pd.DataFrame({'Economic Growth': gdp_growth})
gdp_growth_df


In [None]:
plt.title ("Economic growth of Egypt's gdp",size=15)
plt.xlabel('Year',size=10)
plt.ylabel('GDP Growth',size=10)
plt.xticks(gdp_growth.index)
gdp_growth.plot( color='#004725')
plt.grid()


# ARIMA Modelling <a id='arima'></a>

In [None]:
EGY_time.shape

In [None]:
y_train = EGY_time[:45]
y_test = EGY_time[45:]

In [None]:
model = pm.auto_arima(EGY_time, start_p=1, start_q=1,      # use adftest to find optimal 'd'
                      max_p=30, max_q=30,
                      trace=True,
                      d=1,  max_d = 15,
                      seasonal=False,
                      suppress_warnings=True)

print(model.summary())

In [None]:
model = sm.tsa.arima.ARIMA(y_train, order=(1,1,3))
fitted = model.fit(method = 'innovations_mle')

# Forecast
fc= fitted.forecast(steps=20, alpha=0.05)  # 95% conf

In [None]:
arima_pred = fitted.predict(start=len(y_train), end=len(EGY_time) - 1)

# Set the date index for arima_pred using the date values from the original dataset
arima_pred.index = EGY_time.index[len(y_train):]
arima_pred
# Now arima_pred has the same date index as the original dataset

In [None]:
fc

In [None]:
# Assuming you have already fitted your ARIMA model
forecast = fitted.get_forecast(steps=20)  # Replace 10 with the number of steps you want to forecast

# Extract the forecasted values and associated confidence intervals
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()

# Plot the forecasted values and confidence intervals
plt.figure(figsize=(12, 6), dpi=100)
plt.plot(forecast_mean.index, forecast_mean.values, color='blue', label='Forecast')
plt.fill_between(forecast_ci.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='gray', alpha=0.2, label='95% Prediction Interval')
plt.plot(EGY_time, label='Training (Actual)', color='green', marker='o', markersize=3)
plt.plot(y_test.index, y_test.values, label='Test (Actual)', color='red', marker='o', markersize=3)

plt.xlabel('Time')
plt.ylabel('Value')
plt.grid(True)
plt.legend()
plt.title('ARIMA Forecast vs Actuals')
plt.show()

In [None]:
print('MAPE:',mean_absolute_percentage_error(y_test,arima_pred))
print('MAE:',mean_absolute_error(y_test,arima_pred))
print('RMSE:',mean_squared_error(y_test,arima_pred))
print('AIC:',fitted.aic)

# Rest of preprocessing

***
## outliers <a id='out'></a>

In [None]:
df.shape

In [None]:
df.plot(kind='box', subplots=True,figsize = (10,10) , layout = (5,5))

In [None]:
data =df.copy()
cols = data.drop(['Country', 'Region', 'Country Code'], axis=1).columns

In [None]:
for col in cols:
    percentiles = data[col].quantile([0.025, 0.975]).values
    data[col][data[col]<= percentiles [0]] =percentiles[0]
    data[col][data[col] >= percentiles [1]] = percentiles [1]

In [None]:
data.plot(kind='box', subplots=True,figsize = (10,10) , layout = (5,5))

## Label encoding <a id='le'></a>

In [None]:
df =df.drop(['Country','Country Code'], axis =1) # not import in the modelling

In [None]:
le =LabelEncoder()
df['Region'] = le.fit_transform(df['Region'])

## train test split <a id='split'></a>

In [None]:
x=df.drop('GDP ($ per capita)', axis=1)
y=df['GDP ($ per capita)']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25, random_state=42)

## Scaling <a id='scale'></a>

In [None]:
SC= StandardScaler()

In [None]:
x_scaled=SC.fit_transform(x)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled,y,test_size=0.25, random_state=42)

## VIF <a id='vif'></a>

In [None]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = df.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(df.values, i)
                          for i in range(len(df.columns))]
  
vif_data

In [None]:
data =df.drop(['Agriculture', 'Service', 'Other (%)', 'Arable (%)'], axis =1)

In [None]:
data['population increase']= data['Birthrate']-data['Deathrate']
data =data.drop(['Birthrate','Deathrate'],axis=1)

In [None]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = data.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(data.values, i)
                          for i in range(len(data.columns))]
  
vif_data

In [None]:
data = data.drop('Climate', axis=1)

In [None]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = data.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(data.values, i)
                          for i in range(len(data.columns))]
  
vif_data

In [None]:
data = data.drop('Literacy (%)', axis=1)

In [None]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = data.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(data.values, i)
                          for i in range(len(data.columns))]
  
vif_data

In [None]:
sns.heatmap(data.corr(), annot=True)

In [None]:
x_mc=data.drop('GDP ($ per capita)', axis=1)
y_mc=data['GDP ($ per capita)']
x_train_mc, x_test_mc, y_train_mc, y_test_mc = train_test_split(x_mc,y_mc,test_size=0.25, random_state=42)

**Dropped any colums with large collinearity**

#### trying PCA for dimensenality reduction

In [None]:
pca = PCA(svd_solver='randomized', random_state=42)

In [None]:
pca.fit(x_scaled)

In [None]:
%matplotlib inline
fig = plt.figure(figsize = (8,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.savefig('pca_no')
plt.show() 

**as we trying to solve the multicollinearity problems we have dropped the columns with multicollinearity as first solution and tried to reduce the dimensions of the feature using another way which is pca and baoth ways reached to the same conclusions**

----
# Regression Modelling <a id='reg'></a>

## linear regression <a id ='lin'></a>

### With multicollinearity

In [None]:
lr=LinearRegression()
lr.fit(x_train, y_train)

In [None]:
lr.score(x_train, y_train)

In [None]:
lr.score(x_test, y_test)

In [None]:
y_pred = lr.predict(x_test)

In [None]:
df_predict1 =pd.DataFrame({"Y_test" : y_test.values , "Y_predict": y_pred})
df_predict1.head()

In [None]:
print('R2:',r2_score(y_test, y_pred))
print('MAPE:',mean_absolute_percentage_error(y_test, y_pred))

In [None]:
plt.figure(figsize= (10,5))
plt.plot(df_predict1)
plt.legend(["Actual" , " Predicted"])

## Without Multicollinearity

In [None]:
lr.fit(x_train_mc, y_train_mc)

In [None]:
lr.score(x_train_mc, y_train_mc)

In [None]:
lr.score(x_test_mc, y_test_mc)

In [None]:
y_pred_mc = lr.predict(x_test_mc)
df_predict =pd.DataFrame({"Y_test" : y_test_mc.values , "Y_predict": y_pred_mc})
df_predict.head()

In [None]:
print('R2:',r2_score(y_test_mc, y_pred_mc))
print('MAPE:',mean_absolute_percentage_error(y_test_mc, y_pred_mc))

In [None]:
plt.figure(figsize= (12,6))
plt.plot(df_predict)
plt.legend(["Actual" , " Predicted"])

## Ridge Regression <a id='ridge'></a>

### With Multicollinearity

In [None]:
paramsRidge = {'alpha':[0.01, 0.1, 1,10,100], 'solver' : ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}

ridgeReg = GridSearchCV(Ridge(),paramsRidge, cv = 10)
ridgeReg.fit(X = x_train,y= y_train)
Rmodel = ridgeReg.best_estimator_


In [None]:
print(ridgeReg.best_score_, ridgeReg.best_params_)

In [None]:
ridge = Ridge(alpha=0,solver='lsqr')
rid=ridge.fit(x_train, y_train)

In [None]:
rid.score(x_train, y_train)

In [None]:
rid.score(x_test, y_test)

In [None]:
y_pred_rid = rid.predict(x_test)
pred_rid = pd.DataFrame({"Y_test" : y_test.values , "Y_predict": y_pred_rid})
pred_rid.head(10)

In [None]:
print('R2:',r2_score(y_test,y_pred_rid))
print('MAPE:',mean_absolute_percentage_error(y_test,y_pred_rid))

In [None]:
plt.figure(figsize= (12,6))
plt.plot(pred_rid)
plt.legend(["Actual" , " Predicted"])

### without multicollinearity

In [None]:
paramsRidge = {'alpha':[0.01, 0.1, 1,10,100], 'solver' : ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}

ridgeReg2 = GridSearchCV(Ridge(),paramsRidge, cv = 10)
ridgeReg2.fit(X = x_train_mc,y= y_train_mc)
Rmodel = ridgeReg2.best_estimator_

In [None]:
print(ridgeReg2.best_score_, ridgeReg2.best_params_)

In [None]:
ridge = Ridge(alpha=100, solver= 'auto')
rid2=ridge.fit(x_train_mc, y_train_mc)

In [None]:
rid2.score(x_train_mc, y_train_mc)

In [None]:
rid2.score(x_test_mc, y_test_mc)

In [None]:
y_pred_rid2 = rid2.predict(x_test_mc)
pred_rid2 = pd.DataFrame({"Y_test" : y_test_mc.values , "Y_predict": y_pred_rid2})
pred_rid2.head(10)

In [None]:
print('R2:',r2_score(y_test,y_pred_rid2))
print('MAPE:',mean_absolute_percentage_error(y_test,y_pred_rid2))

## Lasso Regression <a id='lasso'></a>

In [None]:
clf = Lasso(alpha=7, max_iter=8000)

In [None]:
clf =clf.fit(x_train,y_train)

In [None]:
clf.score(x_train,y_train)

In [None]:
clf.score(x_test,y_test)

In [None]:
y_pred_la = clf.predict(x_test)
pred_la = pd.DataFrame({"Y_test" : y_test.values , "Y_predict": y_pred_la})
pred_la.head(10)

In [None]:
print('R2:',r2_score(y_test,y_pred_la))
print('MAPE:',mean_absolute_percentage_error(y_test,y_pred_la))

### edited df

In [None]:
CLF2 = Lasso(alpha=7, max_iter=8000)

In [None]:
clf2 =CLF2.fit(x_train_mc,y_train_mc)

In [None]:
clf2.score(x_train_mc,y_train_mc)

In [None]:
clf2.score(x_test_mc,y_test_mc)

In [None]:
y_pred_la2 = clf2.predict(x_test_mc)
pred_la2 = pd.DataFrame({"Y_test" : y_test_mc.values , "Y_predict": y_pred_la2})
pred_la2.head(10)

In [None]:
print('R2:',r2_score(y_test_mc,y_pred_la2))
print('MAPE:',mean_absolute_percentage_error(y_test_mc,y_pred_la2))

##  KNN regressor <a id='knn'></a>

### With multicollinearity

In [None]:
pram_knn = {"n_neighbors": [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,50,100,150]}
knn = KNeighborsRegressor()
grid_knn = GridSearchCV(estimator= knn , param_grid= pram_knn , cv = 10 )
knn_grid_result = grid_knn.fit(x_train, y_train)

knn_grid_result

In [None]:
print ("Best: %f using %s" %(knn_grid_result.best_score_ , knn_grid_result.best_params_))

In [None]:
model_knn = knn_grid_result.best_estimator_
model_knn

In [None]:
print(model_knn.score(x_train,y_train))
model_knn.score(x_test,y_test)

In [None]:
y_pred_knn = model_knn.predict(x_test)
pred_knn= pd.DataFrame({"Y_test" : y_test.values , "Y_predict": y_pred_knn})
pred_knn.head(10)

In [None]:
print('R2:',r2_score(y_test,y_pred_knn))
print('MAPE:',mean_absolute_percentage_error(y_test,y_pred_knn))

### Without multicollinearity

In [None]:
pram_knn = {"n_neighbors": [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,50]}
knn = KNeighborsRegressor()
grid_knn = GridSearchCV(estimator= knn , param_grid= pram_knn , cv = 10 )
knn_grid_result = grid_knn.fit(x_train_mc, y_train_mc)

knn_grid_result

In [None]:
print ("Best: %f using %s" %(knn_grid_result.best_score_ , knn_grid_result.best_params_))

In [None]:
model_knn = knn_grid_result.best_estimator_
model_knn

In [None]:
print(model_knn.score(x_train_mc,y_train_mc))
model_knn.score(x_test_mc,y_test_mc)

In [None]:
y_pred_knn = model_knn.predict(x_test_mc)
pred_knn= pd.DataFrame({"Y_test" : y_test_mc.values , "Y_predict": y_pred_knn})
pred_knn.head(10)

In [None]:
print('R2:',r2_score(y_test_mc,y_pred_knn))
print('MAPE:',mean_absolute_percentage_error(y_test_mc,y_pred_knn))

# Auto ML <a id='auto'></a>

## ML Jar <a id='jar'></a>

In [None]:
automl = AutoML(algorithms=["Linear", 'Baseline', 'Xgboost', 'CatBoost'],
                total_time_limit=5*60)
automl.fit(x_train_mc, y_train_mc)

In [None]:
print(automl.score(x_train_mc, y_train_mc))
automl.score(x_test_mc, y_test_mc)

In [None]:
auto_pred = automl.predict(x_test_mc)

In [None]:
print('R2:',r2_score(y_test_mc,auto_pred))
print('MAPE:',mean_absolute_percentage_error(y_test_mc,auto_pred))

## PyCaret <a id='caret'></a>

In [None]:
clf = setup(data, target='GDP ($ per capita)', session_id = 123)

In [None]:
best_models = compare_models()

In [None]:
best_model = create_model("et")
tuned_model_mc = tune_model(best_model)

In [None]:
predictions = predict_model(tuned_model_mc, data=x_test_mc)
predictions

In [None]:
print(tuned_model_mc.score(x_train_mc, y_train_mc))
tuned_model_mc.score(x_test_mc, y_test_mc)

In [None]:
et_pred =pd.DataFrame({"Y_test" : y_test_mc.values , "Y_predict": predictions['prediction_label']})
et_pred.head()

In [None]:
plt.figure(figsize= (12,6))
plt.plot(et_pred)
plt.legend(["Actual" , " Predicted"])

In [None]:
feat_weight = tuned_model.feature_importances_
feat_weight


In [None]:
Features =pd.DataFrame({"Features" : x.columns , "Weight": feat_weight}, )
cell_hover = {
    "selector": "td:hover",
    "props": [("background-color", "#FFFFE0")]
}
index_names = {
    "selector": ".index_name",
    "props": "font-style: italic; color: darkgrey; font-weight:normal;"
}
headers = {
    "selector": "th:not(.index_name)",
    "props": "background-color: #193EB0; color: white;"
}
properties = {"border": "1px solid black", "width": "65px", "text-align": "center"}
Features = Features.style.background_gradient(cmap="BuPu").format(precision=2).set_table_styles([cell_hover, index_names, headers]).set_properties(**properties)
Features

In [None]:
feat_weight_mc = tuned_model_mc.feature_importances_
feat_weight_mc

In [None]:
Features_mc =pd.DataFrame({"Features" : x_mc.columns , "Weight": feat_weight_mc})
cell_hover = {
    "selector": "td:hover",
    "props": [("background-color", "#FFFFE0")]
}
index_names = {
    "selector": ".index_name",
    "props": "font-style: italic; color: darkgrey; font-weight:normal;"
}
headers = {
    "selector": "th:not(.index_name)",
    "props": "background-color: #193EB0; color: white;"
}
properties = {"border": "1px solid black", "width": "65px", "text-align": "center"}
Features_mc = Features_mc.style.background_gradient(cmap="BuPu").format(precision=2).set_table_styles([cell_hover, index_names, headers]).set_properties(**properties)
Features_mc

# Conclusions <a id='conc'></a>

- <span style = 'font-size:20px;'> Sectors: </span> 
>- <span style = 'font-size:18px;'> Focus on the service sector
>- <span style = 'font-size:18px;'> Try to reclaim more land to improve agriculture and increase arable land
- <span style = 'font-size:20px;'> Phone is an indication of the wealth of people in the country
- <span style = 'font-size:20px;'> increase net migration rate
- <span style = 'font-size:20px;'> Attract International Investments
> - <span style = 'font-size:18px;'> Improve safety: In 2011, Egypt ranked 121. In 2021, Egypt ranked 65.
- <span style = 'font-size:20px;'> Population Growth Rate: 
> - <span style = 'font-size:18px;'> Spread awareness to decrease it
- <span style = 'font-size:20px;'> Literacy:
> - <span style = 'font-size:18px;'> Attract international students
- <span style = 'font-size:20px;'> Why International students:
> - <span style = 'font-size:18px;'> Helps improve education
> - <span style = 'font-size:18px;'> Increase Migration rate 


## Thanks For Reading
## Made By Team 209: Tasbih Othman & Ibrahim Hossam 