Skip to content

Project_Exam_1

PallaviArikatla edited this page Jul 7, 2020 · 21 revisions

INTRODUCTION:

Primary moto of this project is to implement Deep Learning algorithms in Python. This project majorly uses KNN, Naïve bayes, SVM and elbow methods.

SOFTWARE REQUIRED:

• Jupyter notebook (with installed plotting libraries)

• Python 3.

OBJECTIVE:

To implement different algorithms and models learnt in Python Deep Learning so far. And to identify better model by calculating score and model efficiencies.

METHODS:

Different algorithms and methods used:

  • KNN – K nearest neighbors is a modest algorithm that aids in categorizing cases based on resemblance.
  • Naïve bayes – Classification of two or multiple classes with the help of bayes theorem.
  • SVM – Support Vector Machines are usually used for classification, regression and detection and removal of outliers.
  • Elbow method – It helps running Kmeans clustering and used to determine number of clusters in the given dataset.
  • NLTK - Count_vectorizer and tfidf

WORKFLOW:

Question 1:

a)Apply Any classification of your choice (KNN, Naïve Bayes, SVM, Random Forest, ...) and report the performance.

i) Loaded the data using pandas library and created data frame.

ii) As the target column is "class", sliced the dataset accordingly target column into y_train and all other columns into x_train.

iii) Split the data into test and train using train_test_spilt with the probability of test size 0.4 and with the random_state=0.

iv) Created GaussianNB() object to implement Naive Bayes algorithm.

v) Found predictions using X_Test data.

vi) Evaluated the model by finding the accuracy score for the test data.

vii) Got the classification report for the test and predicated data.

viii) Applied KNN classification algorithm.

ix) Now calculated the score for the test data.

CODE:

OUTPUT:

b) Visualize the number of samples per class.

i) Used matplot library to visualize the required data.

ii) Found fraud and non fraud transactions from the data set.

iii) Drawn area plot to visualize the fraud and non fraud data.

CODE:

OUTPUT:

c)Discuss challenges faced while dealing with imbalanced datasheet.

Problems with imbalanced datasheet:

i) The main problem while dealing with imbalanced dataset is that ML algorithms produce wrong results. This is because these algorithms shows bias towards the majority class.

ii) ML algorithms won't consider minority class since they are very less in dataset.

iii) Consider we have 1% minority class and 99% majority class data then ML algorithms treats all of them belongs to majority class.

Handling imbalanced datasheet:

There are mainly two methods to handle this problem,

  1. Oversampling: This method removes the imbalance in the datasheet by creating new minority class instances.

We have 4 types of methods in Over sampling

i) Random oversampling:

This method adds the minority class instances radomly by replicating the existing minority class instances.It results in over fitting.

ii) Cluster oversampling:

In this method, KMeans algorithm will be applied to find the cluster. After that each cluster will be designed to have equal number of instances.

We will have over fitting in this method too.

iii) Synthetic oversampling: In this method, a small portion of minority class will be selected and this subsets are created to balance the data. This will eliminate over fitting.

iv) Modified Synthetic Oversampling: This is same as synthetic oversampling but inherits distribution of the minority classes.

  1. Under sampling:

In this method, the imbalance will be reduced by concentrating on the majority class. To do this we have a method "Random undersampling"

In Random undersampling, existing majority classes will be eliminated randomly. This method is not safe because it even eliminates the useful data.

QUESTION 2:

a) Apply K-means on the data set, report K using elbow method.

i) Import the necessary libraries from scikit library.

ii) create the data frame using the dataset.

iii) To find null values in the columns use isnull() method.

iv) using input data and plot the graph with elbow method then found the number of clusters.

v) Here we got 5 clusters.

vi) Using the below plot, we can infer that there optimized k value is 5.

CODE:

OUTPUT:

b)Evaluate with silhouette score or other scores relevant for unsupervised approaches.

i) create KMeans Algorithm model.

ii)Now fit x data to into the model.

iii)Found prediction values passing x values.

iv) Calculate score actual and predicted.

CODE:

OUTPUT:

c) Visualize cluster result.

  1. Used matplot library and drawn a scatter plot for the clusters.

CODE:

OUTPUT:

QUESTION 3:

a) Apply some Exploratory Data Analysis on the given data set to draw some insight from the data.

  • Eliminate unnecessary columns:

  • Eliminate duplicate columns:

  • Remove null values:

b) Visualize the data and draw the model line.

i) Imported the necessary libraries.

ii) Red the weather.csv dataset and create a dataframe.

iii) Remove the unnecessary columns from the dataframe using drop.

iv) Now drop the duplicate columns.

v) Removed the Null values using the isnull() method.

vi) using seaborn library, visualized the temperature varaince and humidity variance.

vii) Now drawn a scatter plot for Temperature vs Windspeed and found the outliers.

viii) Removed the outliers found from the above plot and drawn the new scatter plot.

ix) Drawn plot for temperature vs Visibility and found outliers.

x) Removed the outliers found from the above plot and drawn the new plot.

CODE:

OUTPUT:

c) Evaluate the model and try to interpret the performance that you get.

i)Import pandas, Numpy and also Train_test_spilt and Linear model from scikit library.

ii) Get the numeric_features from the data frame.

iii)Find the top five positive correlated values by taking temperature column.

iv) To remove null values by using isnull() method.

v) Here we got all the Null values as zero.

vi) Now Handling the missing values using interpolate() method.

vii) Split the data as train and test data using train_test_split.

viii) Create the model for linear regression.

ix) Trained the model with the train data.

x) Calculated R2 score and RMSE score.

CODE:

OUTPUT:

As the R2 score is 0.99, we can say that the model is very close to the fitted line.

QUESTION 4:

Use the given dataset and apply different classifications.

i) First import the necessary libraries.

ii) Read the spam.csv file as we cannot read it directly, we have to encode it.

iii) Now clean the text data by dropping the unnecessary columns.

iv) Initialized the countvectorizer.

v) Found out the shape of Text data of the given dataset.

vi) Initialized the Tfidf transformer.

vii) Using Tfidf found out the each word idf_weight score(more used word means less score) of text column.

viii) Now loop throw each text data column for 3 rows and found tfidf score.

CODE:

OUTPUT:

ix) Applied CountVectorizer,TfidfTransformer and MultinomialNB.

x) Trained the model and found prediction.

xi) Got classification report using prediction and test data.

QUESTION 5:

a)Pick any dataset online for theclassification problem which includes both numeric and non-numeric features and Perform exploratory data analysis.

i) Import necessary libraries.

ii) we have picked train.csv dataset as it has both numeric and non-numeric features.

iii) Now using data_frame.info, printed all the columns

iv) Found out the unique features in Lotshape and also nulls using isnull() method.

v) we observed that there are no nulls.

vi) plotting the frequency of features in Lotshape

vii) Drawn Swarmplot between Lotshape and frequency.

viii) Drawn Scatterplot between MSZoning and saleprice.

ix) Drawn Relplot between Roofstyle an saleprice.

b)Apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better results:

i) Import all the necessary libraries from scikit library.

ii) Now check for any Null values and remove them.

iii) Slicing the dataframe with all the columns except target column i.e, saleprice in X and saleprice in y.

iv) Spilt the dataset using train_test_split into train and test.

v) Now calculating score using Naive bayes classification algorithm.

vi) Created GaussianNB() object to implement Naive Bayes algorithm.

vii) Found predictions using X_Test data.

viii) Evaluated the model by finding the accuracy score for the test data.

ix) Got the classification report for the test and predicated data.

x) Now calculated the score using KNN algorithm.

xi) Calculated the score also using SVM linear model for inear kernel rdf kernel.

Linear:

Non-Linear:

xii) We found there is very slight difference after evaluating SVM for linear and rdf kernel.

DATASETS:

Link for datasets: https://drive.google.com/drive/u/0/folders/1IRP9UlclTDTCKO1XD6gKb78ZNJBFCLAR

PARAMETERS:

Target elements in individual questions are our considered parameters.

  • Question 1: class
  • Question 2: In this we have done clustering
  • Question 3: temperature
  • Question 4: We considered class and text features
  • Question 5: saleprice

EVALUATION AND DISCUSSION:

Question 1:

i)we have used both naive baes and KNN classification algorithms.

ii)The accuracy score using naive baes algorithm is 99.320 and the score with KNN is 99.83

iii)We have also inferred that there are 2 classes 0 and 1 and displayed samples in both the classes.

Question 2:

i)Using elbow method, we infered that the optimized K value is 5.

ii)The silhoutte score is 0.55, from this we can say that the model is good.

iii)Through the visualization also we can see that there are 5 clusters

Question 3:

i)We have plotted different columns like temperature and humidity and we found that there are variations.

ii)we found the top 5 correlated values.

iii)From R2 square=0.99 which almost equals to 1, we inferred that model is very close to the fit line.

Question 4:

i)Using count Vectorizer, we analysed the shape of the word_count.

ii)Found the accuarcy score for three classifications using pipeline.

Question 5:

i)We have applied three classification algorithms and infered that Naive bayes classifier better result.

ii)We also found that SVM with linear gives better performance.

CONCLUSION:

This project helped us learning and implementing different classification algorithms. We have analysed all the algorithms by calculating accuracy score and successfully implemented all algorithms.

Video reference:

https://umkc.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=7a6e6506-6d88-478b-9d48-abf0017043f9