## Dataset Overview

**The following dataset was collected during a web development program . It contains data
about selected students who applied for the program . Here is the metadata to give a
description**
### Metadata
Attribute Explanation

<br> Gender -  The specific gender <br>
<br>Location - Where they come from <br>
<br>County Number - County the belong to / International Students<br>
<br>Computer Proficient - Do they have basic computer skills?<br>
<br>Level Of Education - Which level of education they hold<br>
<br>Commitment - If they can commit 2 months to do the program<br>
<br>Access to Device -  Do they have a computer?<br>
<br>Access to Internet - Can they access the Internet to Learn?<br>
<br>Information Gain  - How did they know about the scholarship<br>
<br>Lecturer - Who trained the student?<br>
<br>Selected - Were they selected for the program?<br>
<br>Completed  - Did they complete the program?<br>

## Data Analysis

(Graphical Representations can be bar graphs, pie-charts, swarm plots e.t.c) : Use PYTHON
1. Graphical representation to show the Applications in terms of gender
2. Graphical Representation to show Information Gain
3. Graphical Representation to show Applications in terms of Location/County
4. Graphical Representation to show Applications in terms of Level of education
5. Graphical representation to show how many students completed
6. Graphical representation to show student distribution among the lecturer
7. Graphical representation to show the completion rate for lecturer A , B and E

## Machine Learning
8. Choose a machine learning model & train ; to see if a student will complete
or not.
(Remember to write comments on your code.)


In [None]:
#Import Libraries 

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import matplotlib as mpl
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from matplotlib import rcParams
from sklearn.metrics import explained_variance_score

import matplotlib.pyplot as plt

sns.set_style("darkgrid")

## Load the dataset

In [None]:

df = pd.read_csv('Web Dev Cleaned Data (1).csv')

df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isna().sum()

## Data Cleaning

In [None]:
#Lets drop the missing value - wont affect the dataset 

df = df.dropna()

In [None]:
df.isna().sum()

In [None]:
df.columns

In [None]:
#remane column Completed to remove space
df = df.rename(columns={'Completed ': 'Completed'})

In [None]:
#Modify Values in completed column 'yes' to Yes 

df['Completed'] = df['Completed'].replace('yes','Yes')

In [None]:
#Drop the two duplicate columns
df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
#Save the new CSV
df.to_csv('NewWebDev1.csv')

In [None]:
#Import the new CSV For analysis 

df = pd.read_csv('NewWebDev1.csv')
df.head(5)


## 1. Graphical representation to show the Applications in terms of gender

In [None]:
by_gender = df.groupby('Gender')
by_gender.size().plot(kind='bar')

## 2 .Graphical Representation to show Information Gain

In [None]:
# Plot frequency
df['InformationGain'].value_counts().plot.bar()


## 3. Graphical Representation to show Applications in terms of Location/County

In [None]:
df_location = df.groupby(['Location'])['Location'].count().reset_index(name='count')

import plotly.express as px
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True
fig = px.pie(df_location, values='count', names='Location', title='Applications in terms of Location/County')
fig.show()

In [None]:
pip install plotly

## 4.Graphical Representation to show Applications in terms of Level of education

In [None]:
df.head()

In [None]:
df_education = df.groupby(['Levelofeducation'])['Levelofeducation'].count().reset_index(name='count')
df_education

In [None]:
df_education = df.groupby(['Levelofeducation'])['Levelofeducation'].count().reset_index(name='count')

import plotly.express as px

plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

fig = px.pie(df_education, values='count', names='Levelofeducation', title='Applications in terms of Levelofeducation')
fig.show()

## 5. Graphical representation to show how many students completed

In [None]:
df.head(2)

In [None]:
df.columns

In [None]:
Completed = df.groupby('Completed')
Completed.size().plot(kind='bar')

## 6 .Graphical representation to show student distribution among the lecturer


In [None]:
df['Lecturer'].value_counts().plot(kind='bar')

## 7.Graphical representation to show the completion rate for lecturer A , B and E

In [None]:
#completion_rate  = df.groupby("Lecturer").sum().sort_values(by="Completed")
#completion_rate 

g = df.groupby(['Lecturer', 'Completed']).size().reset_index(name='count')
plt.bar(g['Lecturer'],g['count'])

## 8. Choose a machine learning model & train ; to see if a student will complete or not

## Data Preparation:

Split the data into X and y.

In [None]:
#Columns to use 
df.dtypes

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
import sklearn
X = sklearn.preprocessing.StandardScaler().fit_transform(X)

categorical_columns = [ 'Location', 
       'computerProficient', 'Levelofeducation', 'Commitment ',
       'AccestoDevice', 'AccestoInternet', 'InformationGain', 'Completed ',
       'Lecturer']

numerical_columns = ['CountyNumber']
make_column_transformer(
        (StandardScaler(), [numerical_columns]),
        (OneHotEncoder(), [categorical_column]))
ColumnTransformer(transformers=[('standardscaler', StandardScaler(...),
                                 [numerical_columns]),
                                ('onehotencoder', OneHotEncoder(...),
                                 [categorical_columns])])

In [None]:
one_hot_encoded_data = pd.get_dummies(df, columns = [['Gender', 'Location', 
       'computerProficient', 'Levelofeducation', 'Commitment ',
       'AccestoDevice', 'AccestoInternet', 'InformationGain', 'Completed ',
       'Lecturer']])
print(one_hot_encoded_data)

In [None]:
#Split data into X and Y 

from sklearn.model_selection import train_test_split 

X = categorical_columns
Y = df['Completed']

print(X.columns)

In [None]:
# scaling the features
#Convert the dataframe to an array by scaling 
from sklearn.preprocessing import scale


cols = X.columns
X = pd.DataFrame(scale(X))
X.columns = cols
X.columns

In [None]:
# split data into train and test datasets 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y, train_size=0.8, test_size=0.2, random_state=100)

## Modelling

In [None]:
#1.Linear model from sklean 
#Building the first model with all the feature

from sklearn import linear_model 
from sklearn.metrics import r2_score 
 
lm = linear_model.LinearRegression() 
lm.fit(X_train, y_train) 
y_pred = lm.predict(X_test) 
print(r2_score(y_true=y_test, y_pred=y_pred)) 