<a href="https://colab.research.google.com/github/GreeshmaHarids/Greeshma_Meta_Scifor_Technology/blob/main/Machine_Learning/Week_Assessments/Test_5-NLP_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP




### 1. Write a program of text processing

Steps to text processing:


*   Convert all the text to lowercase
*   Split text into individual words/tokens
*   Removing special characters
*   Removing stop words and punctuations
*   Stemming or Lemmatization



In [29]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')




[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [30]:
text = "Hello!... Everyone today is November 26th and its 3pm and I am going to read."

#creating function to do text processing
def text_process(text):

  #lowercase
  text = text.lower()
  print("After Lowercase:\n", text)

  #tokenization
  tokens = word_tokenize(text)
  print("\nTokens:\n", tokens)

  #Removing special characters
  tokens = [word for word in tokens if word.isalpha()]  # Keep only alphabetic words
  print("\nAfter Removing Special Characters:\n", tokens)

  #Removing stop words
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  print("\nAfter Removing Stop Words:\n", tokens)

  #stemming
  stemmer = PorterStemmer()
  tokens = [stemmer.stem(word) for word in tokens]
  print("\nAfter Stemming:\n", tokens)

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  print("\nAfter Lemmatization:\n", tokens)


text_process(text)


After Lowercase:
 hello!... everyone today is november 26th and its 3pm and i am going to read.

Tokens:
 ['hello', '!', '...', 'everyone', 'today', 'is', 'november', '26th', 'and', 'its', '3pm', 'and', 'i', 'am', 'going', 'to', 'read', '.']

After Removing Special Characters:
 ['hello', 'everyone', 'today', 'is', 'november', 'and', 'its', 'and', 'i', 'am', 'going', 'to', 'read']

After Removing Stop Words:
 ['hello', 'everyone', 'today', 'november', 'going', 'read']

After Stemming:
 ['hello', 'everyon', 'today', 'novemb', 'go', 'read']

After Lemmatization:
 ['hello', 'everyon', 'today', 'novemb', 'go', 'read']


### 2. Write a Program to Implement NLP Based on SpaCy:


In [31]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')



In [32]:
text=nlp(
    "Hello!... Everyone today is November 26th and its 3pm and I am going to read."
)

print(type(text))

[token.text for token in text]

<class 'spacy.tokens.doc.Doc'>


['Hello',
 '!',
 '...',
 'Everyone',
 'today',
 'is',
 'November',
 '26th',
 'and',
 'its',
 '3',
 'pm',
 'and',
 'I',
 'am',
 'going',
 'to',
 'read',
 '.']

In [33]:
# for sentence detection
sent="""Hello... all how are you.
Today is Friday.Happy to be here"""

sent_doc=nlp(sent)
sentence = list(sent_doc.sents)

for i in sentence:
  print(f"'{i[:5]}'")

'Hello... all how are'
'Today is Friday.'
'Happy to be here'


In [34]:
 #Tokenization: Extract tokens
print("Tokens:")
for token in sent_doc:
    print(token.text)


Tokens:
Hello
...
all
how
are
you
.


Today
is
Friday
.
Happy
to
be
here


In [35]:
# Remove special characters (non-alphabetic)
tokens_clean = [token.text for token in sent_doc if token.is_alpha]

print("Cleaned Tokens (No special characters): ", tokens_clean)

Cleaned Tokens (No special characters):  ['Hello', 'all', 'how', 'are', 'you', 'Today', 'is', 'Friday', 'Happy', 'to', 'be', 'here']


In [36]:
# Step 9: Remove stop words and punctuation
tokens_filtered = [token.text for token in sent_doc if not token.is_stop and not token.is_punct]

print("Filtered Tokens (No stop words & punctuation): ", tokens_filtered)

Filtered Tokens (No stop words & punctuation):  ['Hello', '\n', 'Today', 'Friday', 'Happy']


In [37]:
# Step 10: Lemmatization (getting the base form of the words)
tokens_lemmatized = [token.lemma_ for token in sent_doc if not token.is_stop and not token.is_punct]

print("Lemmatized Tokens: ", tokens_lemmatized)

Lemmatized Tokens:  ['hello', '\n', 'today', 'Friday', 'happy']


In [38]:

# Part-of-Speech Tagging
print("\nPart-of-Speech Tags:")
for token in sent_doc:
    print(f"{token.text}: {token.pos_}")



Part-of-Speech Tags:
Hello: INTJ
...: PUNCT
all: PRON
how: SCONJ
are: AUX
you: PRON
.: PUNCT

: SPACE
Today: NOUN
is: AUX
Friday: PROPN
.: PUNCT
Happy: ADJ
to: PART
be: AUX
here: ADV


In [39]:

# Named Entity Recognition (NER)
print("\nNamed Entities:")
for ent in sent_doc.ents:
    print(f"{ent.text} ({ent.label_})")



Named Entities:
Today (DATE)
Friday (DATE)


## Statistics

### 1.  Difference between descriptive and inferential statistics.

1. Descriptive statistics summarize or describes data
  where as inferential statistics makes inferences or conclusions about the population based on the sample.

2. Types of descriptive statistics are:
Measure of Central Tendency: Mean, Median, and Mode.
Measure of Variability: Range, Variance, and Standard Deviation.
Measure of frequency: frequency table, contingency table.
Graphical representations like pie charts, bar charts, etc.

  for inferential statistics: Tests of significance such as hypothesis testing, regression analysis, ANOVA, chi-square, etc.

3. In descriptive statistics use entire dataset and for inferencial statistics uses a sample of the dataset.

4. for descriptive :No assumption is needed about the underlying population.
for inferential we make some assumptions about the population.
5. descriptive statistics is mainly used for : organize, analyze, and present the data in a meaningful way.
where as inferential uses for compare, test, and predict the data.
6.  descriptive stat is not dependent on probability. But inferential is strongly dependent on probability concepts.
7. descriptive statistics is concerned with present or historical data.
where as inferential is concerned with future or unseen data.
8. descriptive statistics Limited to the data at hand. It does not allow for generalizations beyond the data.
inferential statistics: allows generalization to a larger population based on sample data.
9. descriptive statistics: less complex since it simply describes the dataset.
inferential statistics: More complex, as it involves testing hypotheses, estimating parameters, and making predictions with a margin of error.
10. descriptive statistics: provides exact and precise information about the dataset without any error margin.
inferential statistics provides estimates and predictions.


