# Regression Analysis

## About the dataset

This dataset was collected to investigate the relationship between systolic blood pressure (Response) and personal characteristics, school types, location, mode of transport to school, BMI and overweight in secondary school children in Fako division, South West Region, Cameroon.

# Purpose and Objective 

The purpose of this project is to demonstrate knowledge of exploratory data analysis (EDA) [discovery, structuring, cleaning, joining, validating and presenting] on our dataset as a best practice to give us insights and make data fit for further analysis and modelling. Later on, we wil proceed to build our machine learning models using the cleaned and validated data obtained by performing EDA

The main objective in this project is to: 
i) Build a multiple linear regression model to investigate possible predictors of systolic BP and 
ii) Build a simple logistic regression model using overweight as outcome to identify risk factors of overweight. 

# Tools used for analysis, visualizations and model building

In this project, we will use python libraries for analysis (pandas and numpy), visualization (matplotlib) and model building.

# Project parts

This project will be broken down into 3 parts:

# Part 1: Exploratory Data Analysis (EDA)

In this part of the project, we will use the required libraries to get to know our data, find insights, check and handle missing data, as well as visualize our data to spot and identify the characteristics that lie within our data.

# Part 2: Model building 

in this part, we will use the insights generated from the process of the EDA to build our machine learning algorithm.

# Part 3: Presentation of results

what insights did we generate from our model? what is the impact of this finding to whom it may concern.

# Exploratory Data Analysis

# Importing libraries

In [1]:
# import libraries and packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset into a Dataframe
The dataset provided is in the form of an excel file named practical data stored in my local machine. We will load the data from the excel file into a Dataframe and save it in a variable

In [2]:
# Load data from an excel file into a Dataframe and save in a variable
# i will load the data to a variable called students since the data is about school children


students = pd.read_excel(r'C:\Users\THE EYE INFORMATIQUE\Desktop\PRACTICAL  DATA.xlsx')

# Data exploration
We will explore the dataset and answer questions to guide our exploration and analysis of the data as well as accurately model our algorithms.

To begin, let's display the first few rows of our data to get an understanding of how the dataset is structured

In [3]:
# Display the first 10 rows of the data
# we will use .head() method, which by default displays the first 5 rows


students.head(10)

Unnamed: 0,SN,Location,School types,Class,Age,Sex,Household size,Education level,Overweight in family,Mean to go to school,Weight(kg),Height(cm),Abdominal sie,Systolic BP,BMI,Overweight
0,6,Semi urban,Day school,3,13,Female,1,Secondary,No,On foot,43.0,157,75.0,118.0,17.4,No
1,7,Semi urban,Day school,3,14,Female,2,University,No,On foot,46.0,159,76.0,111.0,18.2,No
2,22,Semi urban,Day school,1,10,Female,2,University,No,On foot,33.0,132,63.0,94.0,19.2,YES
3,24,Semi urban,Day school,4,19,Female,2,Primary,No,On foot,65.0,168,80.0,126.0,23.1,No
4,32,Semi urban,Day school,3,12,Female,2,University,No,On foot,31.0,151,70.0,87.0,13.8,No
5,38,Semi urban,Day school,3,12,Female,2,Secondary,No,On foot,46.0,155,68.0,123.0,19.3,No
6,43,Semi urban,Day school,3,13,Female,3,University,No,On foot,50.0,156,73.0,94.0,20.4,No
7,50,Semi urban,Day school,1,11,Female,2,University,No,On foot,42.0,151,65.0,99.0,18.3,No
8,63,Rural,Bording,5,17,Female,3,University,No,On foot,45.0,162,68.0,128.0,17.1,No
9,64,Rural,Bording,2,12,Female,2,Secondary,No,By bike,41.0,146,70.0,108.0,19.3,No


Let's calculate the number of rows and columns in the dataset

In [4]:
# To get the number of rows and columns in our data, we use the shape property

students.shape

(1229, 16)

# Question: 
What do you notice about the shape of the dataset?

- The shape of the dataset is (1229, 16). Jhe first number, 1229, represents the number of rows (entries). The second number, 16, represents the number of columns. According to this dataset, 1229 school students were interviewed and evaluate and this dataset shows 16 aspects of each student.

Let's find out the characteristics or aspects (columns) of our dataset, and eliminate the ones we might not need. 

In [5]:
# Display all the columns in the data set.

students.columns

Index(['SN', 'Location', 'School types', 'Class ', 'Age', 'Sex',
       'Household size', 'Education level', 'Overweight in family',
       'Mean to go to school ', 'Weight(kg)', 'Height(cm)', 'Abdominal sie',
       'Systolic BP', 'BMI', 'Overweight'],
      dtype='object')

# Let's do some clean-up
Some of these columns are not very relevant to us and hence we will drop them all (SN, Class, Household size, Overweight in family, Abdominal sie)

In [6]:
# Drop columns [SN, Class, Household size, Overweight in family, Abdominal sie]
# we set the axis to 1, this is to indicate we wish to delete the columns
# we will asign the results back to students variable


students = students.drop(['SN', 'Class ', 'Household size', 'Overweight in family', 'Abdominal sie'], axis = 1)

students.shape

(1229, 11)

Now we are left with 11 columns which we will use for our analysis.

Now, let's preview our dataset 

In [7]:
students

Unnamed: 0,Location,School types,Age,Sex,Education level,Mean to go to school,Weight(kg),Height(cm),Systolic BP,BMI,Overweight
0,Semi urban,Day school,13,Female,Secondary,On foot,43.0,157,118.0,17.400000,No
1,Semi urban,Day school,14,Female,University,On foot,46.0,159,111.0,18.200000,No
2,Semi urban,Day school,10,Female,University,On foot,33.0,132,94.0,19.200000,YES
3,Semi urban,Day school,19,Female,Primary,On foot,65.0,168,126.0,23.100000,No
4,Semi urban,Day school,12,Female,University,On foot,31.0,151,87.0,13.800000,No
...,...,...,...,...,...,...,...,...,...,...,...
1224,Rural,Day school,15,Male,University,By bike,45.5,156,112.0,18.696581,No
1225,Rural,Day school,14,Female,,By car,57.0,163,135.0,21.300000,No
1226,Rural,Day school,16,Male,Secondary,By car,54.0,158,106.0,21.631149,No
1227,Rural,Bording school,15,Female,Secondary,On foot,47.0,154,109.0,19.900000,No


# Now, let's get some basic information about our dataset
To further understand what the dataset entails, let's get basic information about the dataset including the datatypes of each column and number of columns without null values

In [8]:
# we will the info() function from pandas to obtain this infomation

students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229 entries, 0 to 1228
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Location               1229 non-null   object 
 1   School types           1229 non-null   object 
 2   Age                    1229 non-null   int64  
 3   Sex                    1229 non-null   object 
 4   Education level        1228 non-null   object 
 5   Mean to go to school   1229 non-null   object 
 6   Weight(kg)             1229 non-null   float64
 7   Height(cm)             1229 non-null   int64  
 8   Systolic BP            1224 non-null   float64
 9   BMI                    1229 non-null   float64
 10  Overweight             1229 non-null   object 
dtypes: float64(3), int64(2), object(6)
memory usage: 105.7+ KB


# Questions
What can you comment about the type of data in the various columns?

- The dataset has both number (float64 and int64) and string (objects) variables. the number variables are called continous while the string variables are called categorical data. Certain computations are difficult or not even possible to be performed on strings in python, hence, later we will transform all these categories to be represented by numbers which will make computation easier.

- Its also weathy to note that our data is pretty clean as it has no missing, or null values in all the columns 

# More on getting to know our data

Let explore some descriptive statistics on our dataset

We are going to find some descriptive statistics and structure our data. We will use the describe() function from the pandas library. it will generate statistics for the numeric columns in our dataset

In [10]:
# let's get descriptive statistics for our numeric data.

students.describe()

Unnamed: 0,Age,Weight(kg),Height(cm),Systolic BP,BMI
count,1229.0,1229.0,1229.0,1224.0,1229.0
mean,14.827502,54.074369,159.37917,113.162582,21.873039
std,1.861103,11.438974,20.850822,14.972437,10.066409
min,10.0,0.0,49.0,64.0,0.0
25%,14.0,47.0,154.0,104.0,18.9
50%,15.0,54.0,160.0,112.0,20.9
75%,16.0,61.0,165.0,122.0,23.046875
max,19.0,91.0,757.0,184.0,180.8


## Questions

What can you deduce from the descriptives generated by the describe() function?

- The table gives us a summary of our numeric data like the mean, standard deviation, minimum value, maximum values of each variable and ofcourse the percentiles.
- From the table, we can see that the average age of our participant is approximately 15 years, and the youngest student interviewed was 10 years old while the oldest student was 19 years. The age of our participants did not spread widely away from the mean, this is because the standard deviation is about 1.9, which tell us that the ages of our participants did not vary widely.

Let's take a single variable from the table above and describe datailing what the table says about it. I will choose Systolic BP for our case study because it is actually our variable of interest in this study.

Let's run the statistics again, but this time only for our variable of interest; Systolic BP

In [13]:
# summarise the variable systolic BP

students[['Systolic BP']].describe()

Unnamed: 0,Systolic BP
count,1224.0
mean,113.162582
std,14.972437
min,64.0
25%,104.0
50%,112.0
75%,122.0
max,184.0


Our variable of interest have a count of 1224 instead of 1229. This tells us that they are 5 missing values for Systolic BP in our dataset. The mean Sytolic BP for our participants is 113.16 with a standard deviation of std=14.97. this tells us that the Systolic BP of our participants are widely spread out from the mean.

###### The five number summary

The five number summary is i) the minimum ii)the first quartile (Q1) iii) the median or second quartile (Q2) iv) the third quartile (Q3) and v) the maximum.

These five numbers give us a sense about the nature of our variable and its spread as well as the position of our data points

Let's explain and understand each number summary:

- The minimum: this is the lowest value in the dataset

- The first quartile (Q1) or 25th percentile (25%): it is the middle number in the first half of the dataset. it means that 25% of the values in the entire dataset are below Q1 and 75% are above it.

- The median or second quartile (Q2) or 50th percentile: refers to the middle value of the dataset, it means 50% of values in the entire dataset are below Q2 and 50% are above it.

- The third quartile (Q3) or 75th percentile (75%): its the middle number in the second half of the dataset. it means 75% of the values in the entire dataset are below Q3 and 25% are above it.

- The maximum: the largest value in the dataset

- minimum: 64
- Q1: 104
- Q2: 112
- Q3: 122
- maximum: 184


###### Let's handle the few missing data in our variable Systolic BP
