Data-Analysis-and-Statistical-Inference-Python

This repository contains statistical analysis for boston housing database & machine learning statistical analysis for loan dataset of 346 customers.

1. Statistical Analysis of Boston Housing Database

This notebook uses scipy, matplotlib, seaborn and statsmodels.api for effective statistical and visualization analysis.

Generating Descriptive Statistics and Visualizations such as

Median value of owner-occupied homes:

Quartile distribution of the owner occupied homes ranges from 17 to 25, q1 or 25% of range = 17 q2 or 75% of range = 25 and median = 21

Histogram for the Charles river variable:

Through the histogram we can say that tract bounds river are majorly found

Boxplot for the MEDV variable vs the AGE variable. (Discretizing the age variable into three groups of 35 years and younger, between 35 and 70 years and 70 years and older):

Scatter plot to show the relationship between Nitric oxide concentrations and the proportion of non-retail business acres per town:

A linear relationship is seen between Nitric oxide concentrations and the proportion of non-retail business.

Histogram for the pupil to teacher ratio variable:

Statistical Analysis

Is there a significant difference in median value of houses bounded by the Charles river or not?

Hypothesis:
H0:There is no significant difference in the mean values of median value of houses and Charles river variables (Null Hypothesis)
H1:There is a significant difference in the mean values of median value of houses and Charles river variables

Conclusion: Here we get to confirm that the P value is 0 which without doubt less than 0.05 thus we can reject our null hypothesis and opt the other hypothesis

Is there a difference in Median values of houses (MEDV) for each proportion of owner occupied units built prior to 1940 (AGE)?

Hypothesis
H0: All age groups have same mean values (Null Hypothesis)
H1: Atleast one age group mean differs

Resut: F_statistic:36.40764999196599, P-value:1.7105011022702984e-15

Conclusion: Here we get to know the P value is 1.7105011022702984e-15 which is less than 0.05 therefore we reject the null hypothesis and opt the other hypothesis

What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner occupied homes?

Hypothesis
H0: There is no impact of an additional weighted distance to the five Boston employment centres on the median value of owner occupied homes (Null Hypothesis)
H1: There is an impact of an additional weighted distance to the five Boston employment centres on the median value of owner occupied homes

Conclusion: Here we get to know the P value is 1.21e-08 which is less than 0.05 therefore we reject the null hypothesis and see that for every additional weighted distance, median value of owner occupied homes increases by 1.09 unit

2. Machine learning statistical classification for loan dataset

Dataset: Loan_Dataset

About dataset: This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

This notebook uses scikit-learn, matplotlib, itertools, pandas and numpy for effective statistical and visualization analysis.

Visualization of principal amount and loan_status while considering both genders differently:

Visualization of age and loan_status while considering both genders differently:

Visualization of dayofweek and loan_status while considering both genders differently:

Classification Methods:

K Nearest Neighbor(KNN)
Decision Tree
Support Vector Machine
Logistic Regression

KNN: The best accuracy was 0.7857142857142857 with k= 7
Decision Tree: DecisionTrees's Accuracy: 0.6142857142857143
Logistic Regression Modelling: LogisticRegression(C=0.01, solver='liblinear')

Model Evaluation using Test set

Loan Dataset for testing: Test_Dataset

Result ~ Obtaining Jaccard and F1 score for different methods:

KNN:
Jaccard Score: 0.6041666666666666
F1-Score: 0.658318980899626
Decision Tree:
Jaccard Score: 0.6590909090909091
F1-Score: 0.7366818873668188
SVM:
Jaccard Score: 0.78
F1-Score: 0.7583503077293734
Logistic Regression:
Jaccard Score: 0.7358490566037735
F1-Score: 0.6604267310789049
Log Loss: 0.5672153379912981

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Machine Learning Classification ~ Loan Acquiring Dataset.ipynb		Machine Learning Classification ~ Loan Acquiring Dataset.ipynb
README.md		README.md
Statistics Analysis for House Database in Boston.ipynb		Statistics Analysis for House Database in Boston.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Analysis-and-Statistical-Inference-Python

This repository contains statistical analysis for boston housing database & machine learning statistical analysis for loan dataset of 346 customers.

1. Statistical Analysis of Boston Housing Database

Generating Descriptive Statistics and Visualizations such as

Statistical Analysis

2. Machine learning statistical classification for loan dataset

Classification Methods:

Model Evaluation using Test set

Result ~ Obtaining Jaccard and F1 score for different methods:

Through this we can easily tell that Support Vector Machine or SVM is the best choice to proceed forward with statistic evaluation.

About

Uh oh!

Releases

Packages

Languages

Alexiso03/Data-Analysis-and-Statistical-Inference-Python

Folders and files

Latest commit

History

Repository files navigation

Data-Analysis-and-Statistical-Inference-Python

This repository contains statistical analysis for boston housing database & machine learning statistical analysis for loan dataset of 346 customers.

1. Statistical Analysis of Boston Housing Database

Generating Descriptive Statistics and Visualizations such as

Statistical Analysis

2. Machine learning statistical classification for loan dataset

Classification Methods:

Model Evaluation using Test set

Result ~ Obtaining Jaccard and F1 score for different methods:

Through this we can easily tell that Support Vector Machine or SVM is the best choice to proceed forward with statistic evaluation.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages