# CS210 Project - Comparative Education Index in Turkey, Before and After the Earthquake

#Team Information
##Serhan YILMAZ, Bilgehan Bilgin, Mustafa Harun Şendur, Beste Bayhan
##Team Name: Sanity Check


# Introduction

This project aims to create an ML model. for predicting an education index for every province in Turkey. Now, the ML model uses the data columns provided by istatistik.meb.gov.tr. The data columns are: Student per Teacher, Student per School, Student per Classroom, Budget per Student and the HDI Index, provided by the United Nations Development Programme (UNDP). The features are gathered for 4 different education levels: Kindergarten, Primary School, Secondary School, and High School. The data is gathered from the years 2012 to 2020.

The education index is calculated with giving weights to each feature, as seen below. The functions are defined by ourselves, by giving consideration to each feature. After the pre_hdi_education_index is created, the final education_index is created by multiplying the index by the HDI Index, to take matters such as Socioeconomical development into consideration.

Using the data from 2012 to 2020, the data from 2021 to 2030 is predicted using ML Regression Models. After the data is predicted (generated) for the years 2021 to 2030, the education index generator is used again to calculate the new (predicted) education index for those years.

The years' different datasets are finally merged (known and predicted datasets) and the final dataset is visualized to provide insights for the future of development in Turkish cities, and gives great insight for where to invest and where to change policies all throughout Turkey.

# Loading Modules and Data

## Importing libraries

In [3]:
#from google.colab import drive 
import numpy as np
import pandas as pd
import seaborn as sns
#import tabula
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

#Defining the Functions for Later Use

##Histogram Visualizer Function

In [4]:
def histogrammer(data, year):
  data.hist(figsize=(15, 15))
  plt.suptitle(f'Histograms for {year} Dataset')
  plt.show()

##Box Plotter Function

In [5]:
def boxplotter(data, year):
  data.plot(kind='box', subplots=True, layout=(4, 4), figsize=(15, 15), sharex=False, sharey=False)
  plt.suptitle(f'Box Plots for {year} Dataset')
  plt.show()

##Train and Predict Function

In [6]:
# Function to train a model and make predictions
def train_and_predict(X_train, y_train, X_future):
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor.predict(X_future)

##Calculate Education Index Function

In [7]:
def calculate_education_index(data, weights):
    normalized_data = (data - data.min()) / (data.max() - data.min())
    education_index_pre_hdi = sum(normalized_data[column] * weight for column, weight in weights.items())
    education_index = education_index_pre_hdi * data['hdi_index']
    return education_index

#Define Weights for Each Feature

In [8]:
weights = {
    'stu_per_tch_kindergarten': 0.05,
    'stu_per_sch_kindergarten': 0.05,
    'stu_per_class_kindergarten': 0.05,
    'budget_per_stu_kindergarten': 0.05,
    'schoolization_ratio_kindergarten': 0.2,
    
    'stu_per_tch_primary': 0.05,
    'stu_per_sch_primary': 0.05,
    'stu_per_class_primary': 0.05,
    'budget_per_stu_primary': 0.05,
    'schoolization_ratio_primary': 0.2,

    'stu_per_tch_secondary': 0.05,
    'stu_per_sch_secondary': 0.05,
    'stu_per_class_secondary': 0.05,
    'budget_per_stu_secondary': 0.05,
    'schoolization_ratio_secondary': 0.2,

    'stu_per_tch_high': 0.05,
    'stu_per_sch_high': 0.05,
    'stu_per_class_high': 0.05,
    'budget_per_stu_high': 0.05,
    'schoolization_ratio_high': 0.2,
}

#Starting the Data Processing

##Load and Process the Data for each Year

In [11]:
all_data = []
for year in range(2019, 2022):
    data = pd.read_csv(f'C:\\Users\\Serhan\\Desktop\\CS210 Proje\\Sanity-Check\\data\\csv\\data_{year}.csv', encoding='ISO-8859-1')
    education_index = calculate_education_index(data, weights)
    data['education_index'] = education_index
    data['year'] = year
    all_data.append(data)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 1318: invalid start byte

##Combine the Data for All Years


In [None]:
combined_data = pd.concat(all_data, ignore_index=True)


#Some Data Visualization to See Our Work

##Visualize the Correlation Matrix Heatmap

In [None]:
corr_matrix = combined_data.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()

##Visualize the Actual Values for Every Column for the Years 2012 to 2020

In [None]:
for column in weights.keys():
    plt.plot(combined_data['year'], combined_data[column], marker='o', label=column)
plt.xlabel('Year')
plt.ylabel('Actual Values')
plt.title('Actual Values for Each Column (2012-2020)')
plt.legend()
plt.grid()
plt.show()

##Visualize the Calculated Education Index for the Years 2012 to 2020

In [None]:
plt.plot(combined_data['year'], combined_data['education_index'], marker='o')
plt.xlabel('Year')
plt.ylabel('Actual Education Index')
plt.title('Actual Education Index for 2012-2020')
plt.grid()
plt.show()

#Prepare the Future Data

In [None]:
future_years = np.arange(2021, 2031)
future_data = pd.DataFrame({'year': future_years})

#Predict Each Column for the Years 2021 to 2030

In [None]:
for column in weights.keys():
    X_train = combined_data[['year']]
    y_train = combined_data[column]
    future_data[column] = train_and_predict(X_train, y_train, future_data[['year']])

#Calculate the Education Index for the Predicted Values

In [None]:
# Calculate the education index for the predicted values
future_data['predicted_education_index'] = calculate_education_index(future_data, weights)

#Visualize the Predicted Education Index for the Years 2021 to 2030

In [None]:
plt.plot(future_data['year'], future_data['predicted_education_index'], marker='o')
plt.xlabel('Year')
plt.ylabel('Predicted Education Index')
plt.title('Predicted Education Index for 2021-2030')
plt.grid()
plt.show()

#Combine the Actual and Predicted Indexes Into One DataFrame

In [None]:
actual_indexes = combined_data[['year', 'education_index']].rename(columns={'education_index': 'actual_education_index'})
predicted_indexes = future_data[['year', 'predicted_education_index']]
all_indexes = pd.concat([actual_indexes, predicted_indexes], axis=1)

#Plot the Actual and Predicted Education Indexes

In [None]:
plt.plot(all_indexes['year'], all_indexes['actual_education_index'], marker='o', label='Actual Education Index (2012-2020)')
plt.plot(all_indexes['year'], all_indexes['predicted_education_index'], marker='o', linestyle='--', label='Predicted Education Index (2021-2030)')
plt.xlabel('Year')
plt.ylabel('Education Index')
plt.title('Actual and Predicted Education Index (2012-2030)')
plt.legend()
plt.grid()
plt.show()

#Our Project Ends Here, See You In Future Projects!