<a href="https://colab.research.google.com/github/DevSecOps-Stack/NLP/blob/main/nlp_practise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Index of Topics Covered:

	1. General Feature Extraction
	Ø File loading
	Ø Word counts
	Ø Characters count
	Ø Average characters per word
	Ø Stop words count
	Ø Count #HashTags and @Mentions
	Ø Detection of numeric digits in texts
	Ø Upper case word counts

	2. Preprocessing and Cleaning
	Ø Lower case conversion
	Ø Contraction to expansion
	Ø Emails removal and counts
	Ø URLs removal and counts
	Ø Removal of RT
	Ø Special characters removal
	Ø Removal of multiple spaces
	Ø HTML tags removal
	Ø Accented characters removal
	Ø Stop words removal
	Ø Conversion to base form of words
	Ø Common occurring words removal
	Ø Rare occurring words removal
	Ø Word cloud visualization
	Ø Spelling correction
	Ø Tokenization
	Ø Lemmatization
	Ø Detecting entities using Named Entity Recognition (NER)
	Ø Noun detection
	Ø Language detection
	Ø Sentence translation
	Ø Using inbuilt sentiment classifier

**1. General Feature Extraction**


**File Loading**

In [10]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

print(data.head())


                  text_column  avg_word_length
0      This is a sample text.             3.60
1            Another example.             7.50
2  Python programming is fun.             5.75
3   OpenAI creates AI models.             5.50
4               Hello, world!             6.00


**Word Counts**

In [17]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

def word_count(text):
    words = text.split()
    return len(words)

# Applying the word_count function to the 'text_column' and creating a new 'word_count' column
data['word_count'] = data['text_column'].apply(word_count)

# Printing the first few rows of the DataFrame with the new 'word_count' column
print(data[['text_column', 'word_count']].head())


                  text_column  word_count
0      This is a sample text.           5
1            Another example.           2
2  Python programming is fun.           4
3   OpenAI creates AI models.           4
4               Hello, world!           2


**Characters Count**

In [18]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

def char_count(text):
    return len(text)

# Applying the char_count function to the 'text_column' and creating a new 'char_count' column
data['char_count'] = data['text_column'].apply(char_count)

# Printing the first few rows of the DataFrame with the new 'char_count' column
print(data[['text_column', 'char_count']].head())


                  text_column  char_count
0      This is a sample text.          22
1            Another example.          16
2  Python programming is fun.          26
3   OpenAI creates AI models.          25
4               Hello, world!          13


**Average Characters per Word**

In [19]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

def avg_word_length(text):
    words = text.split()
    return sum(len(word) for word in words) / len(words)

# Applying the avg_word_length function to the 'text_column' and creating a new 'avg_word_length' column
data['avg_word_length'] = data['text_column'].apply(avg_word_length)

# Printing the first few rows of the DataFrame with the new 'avg_word_length' column
print(data[['text_column', 'avg_word_length']].head())


                  text_column  avg_word_length
0      This is a sample text.             3.60
1            Another example.             7.50
2  Python programming is fun.             5.75
3   OpenAI creates AI models.             5.50
4               Hello, world!             6.00


**Stop Words Count**

In [20]:
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

stop_words = set(stopwords.words('english'))

def stop_words_count(text):
    words = text.split()
    return len([word for word in words if word.lower() in stop_words])

# Applying the stop_words_count function to the 'text_column' and creating a new 'stop_words_count' column
data['stop_words_count'] = data['text_column'].apply(stop_words_count)

# Printing the first few rows of the DataFrame with the new 'stop_words_count' column
print(data[['text_column', 'stop_words_count']].head())


                  text_column  stop_words_count
0      This is a sample text.                 3
1            Another example.                 0
2  Python programming is fun.                 1
3   OpenAI creates AI models.                 0
4               Hello, world!                 0


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Count #HashTags and @Mentions**

In [27]:
import pandas as pd
import re

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text. #example", "Another example. @user", "Python programming is fun. #Python", "OpenAI creates AI models.", "Hello, world! @openai"]
})

def hashtag_count(text):
    hashtags = re.findall(r'#\w+', text)
    return len(hashtags)

def mention_count(text):
    mentions = re.findall(r'@\w+', text)
    return len(mentions)

# Applying the hashtag_count and mention_count functions to the 'text_column' and creating new columns
data['hashtag_count'] = data['text_column'].apply(hashtag_count)
data['mention_count'] = data['text_column'].apply(mention_count)

# Printing the first few rows of the DataFrame with the new columns
print(data[['text_column', 'hashtag_count', 'mention_count']].head())


                          text_column  hashtag_count  mention_count
0     This is a sample text. #example              1              0
1              Another example. @user              0              1
2  Python programming is fun. #Python              1              0
3           OpenAI creates AI models.              0              0
4               Hello, world! @openai              0              1


**Detection of Numeric Digits**

In [33]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text with number 123.", "Another example.", "Python programming is fun.", "OpenAI creates AI models.", "Hello, world!"]
})

def has_digits(text):
    return any(char.isdigit() for char in text)

# Applying the has_digits function to the 'text_column' and creating a new 'has_digits' column
data['has_digits'] = data['text_column'].apply(has_digits)

# Printing the first few rows of the DataFrame with the new 'has_digits' column
print(data[['text_column', 'has_digits']].head())


                              text_column  has_digits
0  This is a sample text with number 123.        True
1                        Another example.       False
2              Python programming is fun.       False
3               OpenAI creates AI models.       False
4                           Hello, world!       False


**Upper Case Word Counts**

In [36]:
import pandas as pd

# Sample DataFrame for demonstration
data = pd.DataFrame({
    'text_column': ["This is a sample text.", "Another EXAMPLE.", "Python programming is FUN.", "OpenAI creates AI MODELS.", "HELLO, world!"]
})

def upper_case_count(text):
    words = text.split()
    return len([word for word in words if word.isupper()])

# Applying the upper_case_count function to the 'text_column' and creating a new 'upper_case_count' column
data['upper_case_count'] = data['text_column'].apply(upper_case_count)

# Printing the first few rows of the DataFrame with the new 'upper_case_count' column
print(data[['text_column', 'upper_case_count']].head())


                  text_column  upper_case_count
0      This is a sample text.                 0
1            Another EXAMPLE.                 1
2  Python programming is FUN.                 1
3   OpenAI creates AI MODELS.                 2
4               HELLO, world!                 1
