# **Natural Language Processing**
## **What is NLP?**
**NLP stands for Natural Language Processing**, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human's languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

![alt text](https://lh4.googleusercontent.com/HvOsRDR2W55YFM1ffjeBAmCr1hUVeQlT_h2NZXd_dQCdqiaKTDeWIUg28wmznHhI9WhO2tc6q6DVu5SMmfKtfhiCiEocD82JFWSnpich9okxrfBtbwJynbbe8FgUF2vdT2S0Fb0)

## **Applications of NLP**
There are the following applications of NLP -
1. **Question Answering**

Question Answering focuses on building systems that automatically answer the questions asked by humans in a natural language.

![alt text](https://lh4.googleusercontent.com/llxua0Kxt4LY6Jq0KtwdoLqVp9VmGYtpU7SXy71KHB4tW7CRGT8tQ2xGmaoTZ3gX0CC5TQgRyuKEQttbUVB30yNQuI7R1HcEJKSKNms8BhqBEVnbst7w2KI6UIGzH81Fev77SWQ)

2. **Spam Detection**

Spam detection is used to detect unwanted emails getting to a user's inbox.

![alt text](https://lh3.googleusercontent.com/OMoXE4C6lwKdM0QE55FjxKu8c6pqWt7-5P1AcGfqQPJuSKGjG0xHqKJNtvH9Fo16DbeQLki8uXGuBiblLYXHQE54Xru-hFysYLc1By3vnrRfD33qA51xx1zNnxANhZlaE5aeh-M)

3. **Sentiment Analysis**

Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)

![alt text](https://lh6.googleusercontent.com/ochS5RaVmHfIQyEML5QmtH1G9dXq9Oao3a9K7QuaQ_4Zn7EN5LzVg2Rv1cOJgpoNSdpLWqIZf_2VBeop3cmie3WeG0La9aCc4NkZ5fS_01aAwsJ6ACTi31G_xXe-73rdNISPqiM)

4. **Machine Translation**

Machine translation is used to translate text or speech from one natural language to another natural language.

![alt text](https://lh3.googleusercontent.com/AgvZMNyu7VEKSi2O-1gWlx82EqLCZdT5EmVxF4rEPokG2JJy8fSyPgMEVK5juLDTnkz5WVBQ_Nm6TEmXVrnOg5T2C7dYr2JDbanzPaKOWXINCy0uolVIx-Irez-_0rVl3ao3Eds)

5. **Spelling correction**

Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction.

![alt text](https://lh5.googleusercontent.com/nvn5xkiQ4lncZKob0z5fyMmBIyPyZZu71ltKvXh8qrc9lauc_et6Lt89drEjngOphnUv2xCLUQRkgry0qYF_Sdw4sw7u0g0sjMCRK9X8akuhGHc3VLYS39J4MwxHspHCiGuHkf4)

6. **Speech Recognition**

Speech recognition is used for converting spoken words into text. It is used in applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice biometrics, voice user interface, and so on.

7. **Chatbot**

Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer's chat services.

![alt text](https://lh3.googleusercontent.com/o5A5kbmvoKElNZCsLSlv0aFyWw9Un4tINXuh98RQ-pCkjQvZRujkj7y_XaAZ73BL9wfMZH36ElKb7XOieagpHEJPQSWBLSur2hLAzw7tvypqw7vTqe8kGLv0LabwncV3F3R8glE)

8. **Information extraction**

Information extraction is one of the most important applications of NLP. It is used for extracting structured information from unstructured or semi-structured machine-readable documents.

9. **Natural Language Understanding (NLU)**

It converts a large set of text into more formal representations such as first-order logic structures that are easier for the computer programs to manipulate notations of the natural language processing.







There are various techniques of doing NLP, here we will study the Bag of words model.

Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, the Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

## **Problems we face while NLP:**
**Categorical data**

Since in NLP we deal exclusively with text data, it is very important to handle it as we are already aware that we cannot pass categorical data to the ML model. A ML model works on algorithms which consist of various mathematical and statistical formulas and we cannot pass text as input for a mathematical formula.

**No fixed length**

In NLP we cannot estimate the length of the data. We have to use free flowing data.

## **Flow of Analysing Text data:**

![alt text](https://docs.google.com/drawings/u/0/d/s62D0wOlx8O7edOmTbyHsUw/image?w=341&h=356&rev=1&ac=1&parent=1OSwv1l6BFj2A0EBicPrt6iK4Y3p0Y-Lc)

In [293]:
import numpy as np
import matplotlib.pyplot as pyplot
import pandas as pd

In [294]:
data=pd.read_csv("Restaurant_Reviews.tsv",sep="\t")
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [295]:
x=data.iloc[:,0].values
x[:5],x.shape

(array(['Wow... Loved this place.', 'Crust is not good.',
        'Not tasty and the texture was just nasty.',
        'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
        'The selection on the menu was great and so were the prices.'],
       dtype=object),
 (1000,))

In [296]:
y=data.iloc[:,1:2]
y.shape

(1000, 1)

In [297]:
#Include only necessery characters
import re 
#re.sub() take 1D input only
rev=re.sub('[^a-zA-Z]', ' ', 'Wow... Loved this place.')
print(rev)
text=[]
new=[]
for words in rev.split(" "):
    if words!="":
        print(words,end=" ")

Wow    Loved this place 
Wow Loved this place 

In the bag of words technique we depend on keywords to identify whether it is a positive comment or negative.

All positive keywords are in a bag of words.

### **Step 2:  Get rid of unnecessary words/ symbols.**
For this we will use the library of **re (regular expressions)**

The **re library** has a function **substitute**, here we will mention the following parameters as follows :

What to substitute
Replace with what
Where



In [298]:
filtered_text=[]
for rev in x:
    rev=re.sub('[^a-zA-Z]', ' ', rev)
    filtered_text.append(rev)
x=filtered_text
filtered_text[:5]       

['Wow    Loved this place ',
 'Crust is not good ',
 'Not tasty and the texture was just nasty ',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it ',
 'The selection on the menu was great and so were the prices ']

### **Step 3: Convert text into lowercase and a list of words.**
To convert in lowercase we will use the function **lower()** and to convert into a list of words we will use **split()**.


In [299]:
altered_text=[]
for rev in x:
    rev=rev.lower()
    rev=rev.split()
    altered_text.append(rev)
len(altered_text)
altered_text[999]

['then',
 'as',
 'if',
 'i',
 'hadn',
 't',
 'wasted',
 'enough',
 'of',
 'my',
 'life',
 'there',
 'they',
 'poured',
 'salt',
 'in',
 'the',
 'wound',
 'by',
 'drawing',
 'out',
 'the',
 'time',
 'it',
 'took',
 'to',
 'bring',
 'the',
 'check']

### **Step 4: Using stopwords.**
Stopwords: These are the unnecessary words that donot have a specific meaning.
**Eg. this, is, and, the, a, an etc**

We will download these stopwords from the library **nltk (Natural Language Toolkit Library)**


In [300]:
%pip install nltk




In [301]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\S\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [302]:
from nltk.corpus import stopwords
print(stopwords)

<WordListCorpusReader in 'C:\\Users\\S\\AppData\\Roaming\\nltk_data\\corpora\\stopwords'>


In [303]:
for words in stopwords.words("English"):
    print(words,end=" ")

a about above after again against ain all am an and any are aren aren't as at be because been before being below between both but by can couldn couldn't d did didn didn't do does doesn doesn't doing don don't down during each few for from further had hadn hadn't has hasn hasn't have haven haven't having he he'd he'll her here hers herself he's him himself his how i i'd if i'll i'm in into is isn isn't it it'd it'll it's its itself i've just ll m ma me mightn mightn't more most mustn mustn't my myself needn needn't no nor not now o of off on once only or other our ours ourselves out over own re s same shan shan't she she'd she'll she's should shouldn shouldn't should've so some such t than that that'll the their theirs them themselves then there these they they'd they'll they're they've this those through to too under until up ve very was wasn wasn't we we'd we'll we're were weren weren't we've what when where which while who whom why will with won won't wouldn wouldn't y you you'd you'

### **Step 5: Getting rid of stopwords.**
For this we will create an empty **list mybag[]** and store all the useful words i.e words other than the stopwords in that list.

**Note:** Some words which are present in stopwords but may be useful. Hence, we will create another list **notstopwords** and store those words in that list.

**Eg. not** is present in the list of stopwords. But without ‘not’ the meaning of a review can completely change

**The gravy was not good. → (without not) The gravy was good.**


In [304]:
mybag=[]
nonstopwords=['not']
for data in altered_text:
    my_bag=[]
    for word in data:
        if word not in stopwords.words("English"):
            my_bag.append(word)
        if word in nonstopwords:
            my_bag.append(word)
    mybag.append(my_bag)
mybag[:5]


[['wow', 'loved', 'place'],
 ['crust', 'not', 'good'],
 ['not', 'tasty', 'texture', 'nasty'],
 ['stopped',
  'late',
  'may',
  'bank',
  'holiday',
  'rick',
  'steve',
  'recommendation',
  'loved'],
 ['selection', 'menu', 'great', 'prices']]

### **Step 6: Steaming**
Steaming is the process of replacing the words with the original word from which it has originated.

**Example: loved, loving → love**

For this we use the library **PorterStemmer from nltk**.

We will create an empty list to store all the steam words. So that we get a unique list of stem words. We will first create an object of the **PorterStemmer** class and then use a for loop to convert the words from mybag to its stem word and then add it to the **mystembag[]**.


In [305]:
from nltk.stem.porter import PorterStemmer
my_stembag=[]

In [306]:
ps=PorterStemmer()
for data in mybag:
    my_short_stem=[]
    for word in data:
        stemword=ps.stem(word)
        if stemword not in my_short_stem:
            my_short_stem.append(stemword)
    my_stembag.append(my_short_stem)
my_stembag[:10]

[['wow', 'love', 'place'],
 ['crust', 'not', 'good'],
 ['not', 'tasti', 'textur', 'nasti'],
 ['stop',
  'late',
  'may',
  'bank',
  'holiday',
  'rick',
  'steve',
  'recommend',
  'love'],
 ['select', 'menu', 'great', 'price'],
 ['get', 'angri', 'want', 'damn', 'pho'],
 ['honeslti', 'tast', 'fresh'],
 ['potato',
  'like',
  'rubber',
  'could',
  'tell',
  'made',
  'ahead',
  'time',
  'kept',
  'warmer'],
 ['fri', 'great'],
 ['great', 'touch']]

In [307]:
final_review=[]
for data in my_stembag:
    rev=" ".join(data)
    final_review.append(rev)
final_review[:5]

['wow love place',
 'crust not good',
 'not tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

## **sklearn.feature_extraction.text.CountVectorizer**

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using **scipy.sparse.csr_matrix**.

In [308]:
# 1. wow great
# 2. nice man

# corpus = [wow, great, nice, man]

# countrvectors  = [wow, great, nice, man] => [0, 1, 2, 3]

# nice great wow great = [1, 2, 1, 0]

# 1. wow great => [1, 1, 0, 0]
# 2. nice man => [0, 0, 1, 1]

In [309]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
x=cv.fit_transform(final_review).toarray()
x[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [310]:
len(cv.get_feature_names_out())

1566

In [311]:
cv.get_feature_names_out()[:155]

array(['absolut', 'absolutley', 'accid', 'accommod', 'accomod',
       'accordingli', 'account', 'ach', 'acknowledg', 'across', 'actual',
       'ad', 'afford', 'afternoon', 'ago', 'ahead', 'airlin', 'airport',
       'ala', 'albondiga', 'allergi', 'almond', 'almost', 'alon', 'also',
       'although', 'alway', 'amaz', 'ambianc', 'ambienc', 'amount',
       'ampl', 'andddd', 'angri', 'annoy', 'anoth', 'anticip', 'anymor',
       'anyon', 'anyth', 'anytim', 'anyway', 'apart', 'apolog', 'app',
       'appal', 'appar', 'appeal', 'appet', 'appetit', 'appl', 'approv',
       'area', 'arepa', 'aria', 'around', 'array', 'arriv', 'articl',
       'ask', 'assur', 'ate', 'atmospher', 'atroci', 'attach', 'attack',
       'attent', 'attitud', 'auju', 'authent', 'averag', 'avocado',
       'avoid', 'aw', 'away', 'awesom', 'awkward', 'awkwardli', 'ayc',
       'az', 'baba', 'babi', 'bachi', 'back', 'bacon', 'bad', 'bagel',
       'bakeri', 'baklava', 'ball', 'bamboo', 'banana', 'bank', 'bar',
      

In [312]:
len(x[1])

1566

In [313]:
sum(x[1]==1)

3

In [314]:
sum(x[3]==1),len(x[3])

(9, 1566)

In [315]:
y.values.ravel()

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,

In [316]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=47,stratify=y)

In [317]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
rfr=RandomForestClassifier(n_estimators=10 ,max_depth=15,criterion="entropy")
svc=SVC(kernel="rbf",C=15)

In [318]:
model1=rfr.fit(x_train,y_train)

  model1=rfr.fit(x_train,y_train)


In [319]:
y1_pred=model1.predict(x_test)
y_test.values.ravel(),y1_pred

(array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
        0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,
        0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
        1, 0], dtype=int64),
 array([1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        1

In [320]:
print(f"training accuracy: {rfr.score(x_train,y_train)}")
print(f"testing accuracy: {rfr.score(x_test,y_test)}")

training accuracy: 0.8275
testing accuracy: 0.725


In [321]:
model2=svc.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [322]:
y2_pred=model2.predict(x_test)
y2_pred,y_test.values.flatten()

(array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
        0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
        1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
        1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1,
        1, 0], dtype=int64),
 array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1

In [323]:
print(f"training accuracy: {svc.score(x_train,y_train)}")
print(f"testing accuracy: {svc.score(x_test,y_test)}")

training accuracy: 0.99625
testing accuracy: 0.765


In [324]:
from sklearn.linear_model import LogisticRegression
lgr=LogisticRegression()
model3=lgr.fit(x_train,y_train)
y3_pred=lgr.predict(x_test)
y3_pred,y_test.values.ravel()

  y = column_or_1d(y, warn=True)


(array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
        1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
        0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1,
        1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
        1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
        1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1,
        1, 0], dtype=int64),
 array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1

In [325]:
print(f"training accuracy: {lgr.score(x_train,y_train)}")
print(f"testing accuracy: {lgr.score(x_test,y_test)}")

training accuracy: 0.97375
testing accuracy: 0.78


In [328]:
lgr.intercept_,lgr.coef_

(array([-0.16603819]),
 array([[ 0.2505989 ,  0.19463305,  0.56093744, ..., -0.16947458,
          0.06866962, -0.60703243]]))

In [332]:
result=lgr.predict_proba(x_test)
result

array([[0.02837817, 0.97162183],
       [0.08900271, 0.91099729],
       [0.49146323, 0.50853677],
       [0.17305239, 0.82694761],
       [0.255208  , 0.744792  ],
       [0.05704776, 0.94295224],
       [0.0733257 , 0.9266743 ],
       [0.0756365 , 0.9243635 ],
       [0.82893259, 0.17106741],
       [0.99195849, 0.00804151],
       [0.62055948, 0.37944052],
       [0.24169495, 0.75830505],
       [0.84874108, 0.15125892],
       [0.02924984, 0.97075016],
       [0.51236723, 0.48763277],
       [0.35080066, 0.64919934],
       [0.11416376, 0.88583624],
       [0.64064479, 0.35935521],
       [0.25355851, 0.74644149],
       [0.714687  , 0.285313  ],
       [0.9741721 , 0.0258279 ],
       [0.76582881, 0.23417119],
       [0.69926314, 0.30073686],
       [0.53198459, 0.46801541],
       [0.06596658, 0.93403342],
       [0.63751498, 0.36248502],
       [0.55824512, 0.44175488],
       [0.61094745, 0.38905255],
       [0.19135741, 0.80864259],
       [0.60998341, 0.39001659],
       [0.

In [333]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
model4=nb.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [334]:
y4_pred=model4.predict(x_test)
y4_pred,y_test.values.flatten()

(array([1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
        0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
        0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
        1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
        0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
        1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
        1, 1], dtype=int64),
 array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1

In [335]:
print(f"training accuracy: {nb.score(x_train,y_train)}")
print(f"testing accuracy: {nb.score(x_test,y_test)}")

training accuracy: 0.92375
testing accuracy: 0.665


In [336]:
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier(criterion="entropy",max_depth=15)
model5=dtc.fit(x_train,y_train)
y5_pred=dtc.predict(x_test)
y5_pred,y_test.values.ravel()

(array([1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        1, 0], dtype=int64),
 array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1

In [337]:
print(f"training accuracy: {dtc.score(x_train,y_train)}")
print(f"testing accuracy: {dtc.score(x_test,y_test)}")

training accuracy: 0.80875
testing accuracy: 0.71
