# corpora :set of sentences

### NLP Text Preprocessing Workflow

1. **Tokenization**:
   - **Sentence Tokenization**: Splitting text into sentences.
     - Example: `"Hello world. How are you?"` → `["Hello world.", "How are you?"]`
   - **Word Tokenization**: Splitting sentences into words.
     - Example: `"Hello world."` → `["Hello", "world"]`

2. **Stemming**:
   - Reducing words to their base or root form.
     - Example: `"running"`, `"runner"` → `"run"`

3. **Lemmatization**:
   - Converting words to their base or dictionary form, considering context.
     - Example: `"running"`, `"ran"` → `"run"`

4. **Part-of-Speech (POS) Tagging**:
   - Assigning parts of speech (e.g., noun, verb) to each word.
     - Example: `["run/Noun", "quick/Adjective"]`

5. **Stopwords Removal**:
   - Removing common words that do not contribute much meaning to the text.
     - Example: `"I am going to the store."` → `"going store."`

6. **Dependency Parsing**:
   - Analyzing grammatical relationships between words in a sentence.
     - Example: `"She enjoys reading books"` → Identifies subject-verb-object relationships.

7. **Numerical Conversion**:
   - **One-Hot Encoding**: Converting categorical words into a binary vector representation.
     - Example: `["cat", "dog", "fish"]` → `[[1and modeling. Each step plays a crucial role in cleaning and preparing text data for NLP tasks.

In [77]:
import pandas as pd
import numpy as np

In [79]:
df=pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [81]:
df.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

In [83]:
df['spam']=df['Category'].apply(lambda x:1 if x=='spam' else 0)

In [85]:
df.shape

(5572, 3)

In [87]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


## Train Test Split

In [90]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.Message,df.spam,test_size=0.2)

In [92]:
X_train.shape

(4457,)

In [94]:
X_test.shape

(1115,)

## Create bag of words representation using CountVectorizer

In [97]:
from sklearn.feature_extraction.text import CountVectorizer

v=CountVectorizer()

X_train_cv=v.fit_transform(X_train.values)
X_train_cv

<4457x7665 sparse matrix of type '<class 'numpy.int64'>'
	with 58761 stored elements in Compressed Sparse Row format>

In [99]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

## Train the naive bayes model

In [102]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(X_train_cv,y_train)

In [104]:
X_test_cv=v.transform(X_test)

## Evaluate Performance

## Train the Naive Bayes 

In [109]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(X_train_cv,y_train)

In [111]:
X_test_cv=v.transform(X_test)

## Evaluate Performance

In [114]:
from sklearn.metrics import classification_report
y_pred=model.predict(X_test_cv)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       962
           1       0.99      0.93      0.96       153

    accuracy                           0.99      1115
   macro avg       0.99      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [None]:
emails=[
    'Hey mohan ,can we get together to watch football game tomorrow?',
    'upto 20% discount on parking ,exclusive offer just for you ,dont miss the end'
]

emails_count=v.transform(emails)
model.predict(emails