Mini project on classification problem using different classification algorithm problem statement::
KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.
In this module we will explore the inner workings of KNN, choosing the optimal K values and using KNN from scikit-learn.
The data set we’ll be using is the Iris Flower Dataset which was first introduced in 1936 by the famous statistician Ronald Fisher and consists of 50 observations from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.
Source: https://archive.ics.uci.edu/ml/datasets/Iris
Train the KNN algorithm to be able to distinguish the species from one another given the measurements of the 4 features.
*Its not good to remove the records having missing values all the time. We may end up loosing some data points. So, we will have to see how to replace those missing values with some estimated values (median) *
Change all the classes to numericals (0to2).
Observe the association of each independent variable with target variable and drop variables from feature set having correlation in range -0.1 to 0.1 with target variable.
Observe the independent variables variance and drop such variables having no variance or almost zero variance(variance < 0.1). They will be having almost no influence on the classification.
Plot the scatter matrix for all the variables.
Split the dataset into training and test sets with 80-20 ratio.
Build the model and train and test on training and test sets respectively using scikit-learn. Print the Accuracy of the model with different values of k=3,5,9.
Hint: For accuracy you can check accuracy_score() in scikit-learn
Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from the above list using the Mis classification error
Hint:
Misclassification error (MSE) = 1 - Test accuracy score. Calculated MSE for each model with neighbours = 1,3,5...19 and find the model with lowest MSE
Plot misclassification error vs k (with k value on X-axis) using matplotlib.
#Load all required library
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state=22)
Dear Participant,
Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.
Use the provided dataset in order to do your analysis.
#Attribute Information:
#Matrix column entries (attributes): #name - ASCII subject name and recording number #MDVP:Fo(Hz) - Average vocal fundamental frequency #MDVP:Fhi(Hz) - Maximum vocal fundamental frequency #MDVP:Flo(Hz) - Minimum vocal fundamental frequency #MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several #measures of variation in fundamental frequency #MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude #NHR,HNR - Two measures of ratio of noise to tonal components in the voice #status - Health status of the subject (one) - Parkinson's, (zero) - healthy #RPDE,D2 - Two nonlinear dynamical complexity measures #DFA - Signal fractal scaling exponent #spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
Q3. Check for class imbalance. Do people with Parkinson's have greater representation in the dataset?
Q5. Plot the distribution of all the features. State any observations you can make based on the distribution plots.
Q10. Use regularization parameters of max_depth, min_sample_leaf to recreate the model. What is the impact on the model accuracy? How does regularization help?
Q11. Implement a Random Forest model. What is the optimal number of trees that gives the best result?
German Credit
Estimate default probabilities using logistic regression
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore')
data=pd.read_excel('GermanCredit.XLSX') data.shape
sns.histplot(data.CreditAmount)