<a href="https://colab.research.google.com/github/JOE1904/JOE1904/blob/main/Python_Final_Project_Soft_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **TEXT EXTRACTION AND CLASSIFICATION OF SOCIAL MEDIA CONTENT FROM IMAGES:** <br>
<hr>
 












## **Introduction**<br>
Classification is a powerful technique for handling social media content. By analyzing large volumes of text data, classifiers can automatically categorize social media content into different topics or sentiment categories, helping social media platforms and businesses better understand their users' preferences and opinions. This enables more targeted marketing campaigns, personalized recommendations, and more effective content moderation. Classification models can also be trained to detect and prevent harmful content, such as hate speech or cyberbullying, helping to make social media a safer and more inclusive space for all users.

## **Objective** <br>
The objective of text extraction and classification of social media content from images is to automatically extract text from images, classify it into relevant categories, and derive insights to inform social media strategies and content moderation.

**Installation of Required Libraries:**

In [None]:
!pip install spacy
!python -m spacy.en.download

In [None]:
!python -m spacy download en_core_web_trf

**Importing the Required Libraries**

In [None]:
from IPython.display import SVG, display
import spacy
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import pandas as pd
import numpy as np
import io
%matplotlib inline

In [None]:
# Importing Required Libraries for Model Training and Testing: 
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

**Trial Run on Libraries**

In [None]:
# Trial run for Spacy Library: 
nlp = spacy.load("en_core_web_sm")
doc = nlp("My Name is Inigo Montaya!!")
for token in doc:
  print(token.text,token.pos_,token.dep_)

My PRON poss
Name NOUN nsubj
is AUX ROOT
Inigo PROPN compound
Montaya PROPN attr
! PUNCT punct
! PUNCT punct


In [None]:
# Displaying Stopwords default in Spacy: 
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
stopwords = nlp.Defaults.stop_words
print(stopwords)

{'had', 'few', 'five', 'say', 'than', 'everywhere', 'thence', 'nothing', 'for', 'well', 'namely', 'hereby', 'back', 'seems', 'whither', 'will', 'at', 'front', 'whether', 'almost', 'this', 'on', 'first', 'mostly', 'make', 'bottom', 'against', 'call', 'serious', 'may', 'those', 'to', 'down', 'of', 'toward', 'most', 'from', 'either', 'other', 'over', 'the', 'what', 'give', 'through', 'since', 'somehow', 'whom', 'can', 'twenty', 'though', 'am', 'six', 'did', 'together', 'about', 'many', 'yours', 'something', 'whose', 'eight', 'get', 'is', 'latterly', 'ca', 'third', 'some', 'except', 'cannot', 'so', 'herself', 'done', 'above', 'when', 'not', 'thru', 'everything', 'your', 'does', 'why', 'both', 'used', 'elsewhere', 'enough', 'still', "'re", 'others', 'whenever', 'beforehand', 'are', 'them', 'along', 'therein', 'amount', 'sometimes', 'really', 'please', 'each', 'then', 'could', 'hereupon', 'i', "'ll", 'anywhere', 'her', 'ours', 'that', 'ourselves', 'whereby', 'further', 'whereafter', 'quite',

**Creating and Compiling Datasets**

In [None]:
# Importing Comments From Twitter Dataset: 
df = pd.read_csv('Twitter_Data.csv')
df2 = df.copy()

In [None]:
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
dummy_df = df2.loc[:100].copy()
dummy_df["clean_text_new"] = dummy_df["clean_text"].apply(lambda x:" ".join([word for word in x.split() if x not in stopwords]))
dummy_df
# Note: The dataset is clean of default stopwords of Spacy Library 

Unnamed: 0,clean_text,category,clean_text_new
0,when modi promised â€œminimum government maxim...,-1.0,when modi promised â€œminimum government maxim...
1,talk all the nonsense and continue all the dra...,0.0,talk all the nonsense and continue all the dra...
2,what did just say vote for modi welcome bjp t...,1.0,what did just say vote for modi welcome bjp to...
3,asking his supporters prefix chowkidar their n...,1.0,asking his supporters prefix chowkidar their n...
4,answer who among these the most powerful world...,1.0,answer who among these the most powerful world...
...,...,...,...
96,sabbash mera vote for peppermit abvp,0.0,sabbash mera vote for peppermit abvp
97,yogi adityanath hold 100 rallies seek votes fo...,0.0,yogi adityanath hold 100 rallies seek votes fo...
98,from the very beginningmodi doing wada faramos...,-1.0,from the very beginningmodi doing wada faramos...
99,modi politics hate modiji loves india modiji w...,1.0,modi politics hate modiji loves india modiji w...


In [None]:
# Importing dataset containing Positive and Negative Words: 
df3 = pd.read_excel("Word_List1.xlsx")
df3

Unnamed: 0.1,Unnamed: 0,Negative Sense Word List,Positive Sense Word List
0,0,,
1,1,abnormal,able
2,2,abolish,abundance
3,3,abominable,accelerate
4,4,abominably,accept
...,...,...,...
4716,4716,zenana,
4717,4717,zephyr,
4718,4718,zero,
4719,4719,zol,


In [None]:
df3.columns

Index(['Unnamed: 0', 'Negative Sense Word List', 'Positive Sense Word List'], dtype='object')

In [None]:
# Splitting Datasets for Labelling: 
df_3 = pd.DataFrame(df3['Positive Sense Word List'])
df_4 = pd.DataFrame(df3['Negative Sense Word List'])
df_3["Category"] = [0 for x in range(len(df_3))]
df_3["Offensive"] = [0 for x in range(len(df_3))]
df_4["Category"] = [1 for x in range(len(df_4))]
df_4["Offensive"] = [0 for x in range(len(df_4))]

In [None]:
# Dropping All Null Values from WORDS LIST: 
df_4.dropna(inplace=True)
df_3.dropna(inplace=True)

In [None]:
# Checking for NA Values in Positive Dataset: 
df_3.isnull().sum()

Positive Sense Word List    0
Category                    0
Offensive                   0
dtype: int64

In [None]:
# Checking for NA Values in Negative Dataset: 
df_4.isnull().sum()

Negative Sense Word List    0
Category                    0
Offensive                   0
dtype: int64

In [None]:
# Creating Common Column for Merging in Model Training: 
df_3["Word_List"] = df_3["Positive Sense Word List"]
df_4["Word_List"] = df_4["Negative Sense Word List"]
df_3.drop(["Positive Sense Word List"],axis =1,inplace = True)
df_4.drop(["Negative Sense Word List"],axis=1,inplace = True) 
df3 = pd.concat([df_3,df_4],ignore_index = True)
df3

Unnamed: 0,Category,Offensive,Word_List
0,0,0,able
1,0,0,abundance
2,0,0,accelerate
3,0,0,accept
4,0,0,acclaim
...,...,...,...
9411,1,0,zenana
9412,1,0,zephyr
9413,1,0,zero
9414,1,0,zol


In [None]:
# Creating a Numpy Array of Bad_Words for Cross-Verification: 
bad_words = np.array(pd.read_csv("bad-words.csv"))
bad_words = [bad_words[x][0] for x in range(len(bad_words))]

In [None]:
# Checking for Abusive words in Vocabulary Dataset for Labelling :
df3["Text_Comments"] = df3["Word_List"]
list1 = []
for i in range(len(df3["Text_Comments"])):
  if df3["Text_Comments"][i] in bad_words:
    df3["Offensive"][i] == 1
  else:
    pass

# Rearranging the columns in Vocab Dataset: 
df3.drop("Word_List",axis = 1,inplace=True)
df3.reindex(columns=['Text_Comments',"Category","Offensive"])

Unnamed: 0,Text_Comments,Category,Offensive
0,able,0,0
1,abundance,0,0
2,accelerate,0,0
3,accept,0,0
4,acclaim,0,0
...,...,...,...
9411,zenana,1,0
9412,zephyr,1,0
9413,zero,1,0
9414,zol,1,0


In [None]:
df

Unnamed: 0,clean_text,category
0,when modi promised â€œminimum government maxim...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0
162976,dear rss terrorist payal gawar what about modi...,-1.0
162977,did you cover her interaction forum where she ...,0.0
162978,there big project came into india modi dream p...,0.0


In [None]:
df.columns

Index(['clean_text', 'category'], dtype='object')

In [None]:
# Checking for Abusive Language in Comments Dataset:
offensive = []
for i in range(len(df)):
  if list(str(df["clean_text"][i]).split()) not in bad_words:
    offensive.append(0)
  else: 
    offensive.append(1)
df["Offensive"] = offensive
df

Unnamed: 0,clean_text,category,Offensive
0,when modi promised â€œminimum government maxim...,-1.0,0
1,talk all the nonsense and continue all the dra...,0.0,0
2,what did just say vote for modi welcome bjp t...,1.0,0
3,asking his supporters prefix chowkidar their n...,1.0,0
4,answer who among these the most powerful world...,1.0,0
...,...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0,0
162976,dear rss terrorist payal gawar what about modi...,-1.0,0
162977,did you cover her interaction forum where she ...,0.0,0
162978,there big project came into india modi dream p...,0.0,0


In [None]:
# Conmbining Positive and Neutral Comments into Single Class: 
# Creating a Binary Classification Problem: 
df["category"].replace(to_replace=[0,1],value=0,inplace=True)
df["category"].replace(to_replace=[-1],value=1,inplace=True)
df

Unnamed: 0,clean_text,category,Offensive
0,when modi promised â€œminimum government maxim...,1.0,0
1,talk all the nonsense and continue all the dra...,0.0,0
2,what did just say vote for modi welcome bjp t...,0.0,0
3,asking his supporters prefix chowkidar their n...,0.0,0
4,answer who among these the most powerful world...,0.0,0
...,...,...,...
162975,why these 456 crores paid neerav modi not reco...,1.0,0
162976,dear rss terrorist payal gawar what about modi...,1.0,0
162977,did you cover her interaction forum where she ...,0.0,0
162978,there big project came into india modi dream p...,0.0,0


In [None]:
# Basic Renaming and Rearranging of Comments Dataset:
df["Text_Comments"] = df["clean_text"]
df["Category"] = df["category"]
df.drop(columns=['clean_text', 'category'],axis=1,inplace=True)
df

Unnamed: 0,Offensive,Text_Comments,Category
0,0,when modi promised â€œminimum government maxim...,1.0
1,0,talk all the nonsense and continue all the dra...,0.0
2,0,what did just say vote for modi welcome bjp t...,0.0
3,0,asking his supporters prefix chowkidar their n...,0.0
4,0,answer who among these the most powerful world...,0.0
...,...,...,...
162975,0,why these 456 crores paid neerav modi not reco...,1.0
162976,0,dear rss terrorist payal gawar what about modi...,1.0
162977,0,did you cover her interaction forum where she ...,0.0
162978,0,there big project came into india modi dream p...,0.0


**Preparing Datasets for Compilation**

In [None]:
df = df.reindex(columns=['Text_Comments',"Category","Offensive"])

In [None]:
df3 = df3.reindex(columns=["Text_Comments","Category","Offensive"])

In [None]:
df3.columns

Index(['Text_Comments', 'Category', 'Offensive'], dtype='object')

In [None]:
df.columns

Index(['Text_Comments', 'Category', 'Offensive'], dtype='object')

In [None]:
df = pd.concat([df,df3],ignore_index=True)
df

Unnamed: 0,Text_Comments,Category,Offensive
0,when modi promised â€œminimum government maxim...,1.0,0
1,talk all the nonsense and continue all the dra...,0.0,0
2,what did just say vote for modi welcome bjp t...,0.0,0
3,asking his supporters prefix chowkidar their n...,0.0,0
4,answer who among these the most powerful world...,0.0,0
...,...,...,...
172391,zenana,1.0,0
172392,zephyr,1.0,0
172393,zero,1.0,0
172394,zol,1.0,0


In [None]:
# Labelling Comments Dataset for Offensive Language: 
df4 = pd.DataFrame(columns = ["Text_Comments","Category","Offensive"])
df4["Text_Comments"] = bad_words
df4["Category"] = [1 for x in range(len(df4))]
df4["Offensive"] = [1 for x in range(len(df4))]
df = pd.concat([df,df4])
df.head(10)

Unnamed: 0,index,Text_Comments,Category,Offensive
0,147656.0,joto folar fulee naow aber tomar jabe goodi,0.0,0
1,96390.0,village electrificationâ€™ still leaves the pe...,0.0,0
2,83738.0,meerut speech was 5546 minutes expect sharp at...,1.0,0
3,98689.0,sorry intention not hurt anyone but actions ha...,1.0,0
4,65255.0,all because nehru modi that too because nehru,0.0,0
5,68543.0,dont believe modi will space personally collec...,0.0,0
6,58714.0,these all jokers they dont want give credit modi,0.0,0
7,154547.0,then pls teach chowkidar chors like modishah,0.0,0
8,28278.0,finally rahul gandhi agreed there was surgical...,0.0,0
9,48460.0,love you modi and isro and drdo,0.0,0


In [None]:
# Randomizing Values present within Dataset for Even Data Spread:
df = df.sample(frac=1,random_state=42).reset_index()

In [None]:
# Removing Any Null Values before Model Building: 
df.isnull().sum()
df.dropna(inplace = True)

In [None]:
# Model 1: Predicting Whether Comment is Negative or Not: 
X1 = df["Text_Comments"]
y1 = df["Category"]

In [None]:
# Creating the Training-Testing Dataset: 
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=42)

**MODEL 1: PREDICT WHETHER THE TEXT IS NEGATIVE OR NOT**

In [None]:
# Create the vectorizer
vectorizer = CountVectorizer()
multi_nb = MultinomialNB()

# fit the vectorizer to the corpus
vectorizer.fit(X_train)

# transform the corpus into a document-term matrix
doc_term_matrix = vectorizer.transform(X_train)

# example labels for the corpus
labels = y_train

# create the pipeline
pipeline1 = Pipeline([
    ('vectorizer', CountVectorizer()),  # convert text to numerical features
    ('model', MultinomialNB())     # train a logistic regression classifier
])

# fit the pipeline to the data
pipeline1.fit(X_train,y_train)

# make predictions on new data
y_pred = pipeline1.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

**MODEL- 1: ACCURACY:**

In [None]:
# Accuracy Score for the 1st Model: 
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score: ",accuracy)

Accuracy Score:  0.8359226834734966


**MODEL 2- PREDICTING WHETEHER COMMENT IS OFFENSIVE OR NOT**

In [None]:
# Model 1: Predicting Whether Comment is Negative or Not: 
X1 = df["Text_Comments"]
y1 = df["Offensive"]

In [None]:
# Creating the Training-Testing Dataset: 
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=42)

In [None]:
vectorizer = CountVectorizer()
multi_nb = MultinomialNB()

# fit the vectorizer to the training dataset: 
vectorizer.fit(X_train)

# transform the corpus into a document-term matrix
doc_term_matrix = vectorizer.transform(X1_train)

# example labels for the corpus
labels = y1_train

# create the pipeline
pipeline2 = Pipeline([
    ('vectorizer', CountVectorizer()),  # convert text to numerical features
    ('model', MultinomialNB())     # train a logistic regression classifier
])

# fit the pipeline to the data
pipeline2.fit(X1_train,y1_train)

# make predictions on new data
y1_pred = pipeline2.predict(X1_test)

**MODEL-2 ACCURACY:**

In [None]:
accuracy = accuracy_score(y1_test, y1_pred)
print("Accuracy Score: ",accuracy)

Accuracy Score:  0.9909963410662631


**CODE FOR PREDICTION OF OFFENSIVE AND NEGATIVE COMMENTS :**

In [None]:
# For Negative Comments: 
# predicted_labels = pipeline1.predict(new_data)
# print(predicted_labels)
# For Offensive Comments: 
# predicted_labels = pipeline2.predict(new_data)
# print(predicted_labels)

**IMPLEMETATION OF EASY-OCR:**

In [None]:
!pip3 install torch torchvision torchaudio

In [None]:
!pip install easyocr

In [None]:
import os
import cv2 
import easyocr

In [None]:
# Specifying Image Paths : 
image_path = "Meme13.jpeg"

In [None]:
# Creating EasyOCR Reader to Extract Text from Image: 
reader = easyocr.Reader(['en'])
result = reader.readtext(image_path,paragraph="False",detail=0)
result



['SAW YOU HAD INEGATIVE NUMBER', 'SO LEVENEDYOU OUT']

In [None]:
len(result)
result[1].split()

['SO', 'LEVENEDYOU', 'OUT']

In [None]:
!pip install pyspellchecker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**CORRECTING GRAMMATICAL ERRORS FROM TEXT :**

In [None]:
from spellchecker import SpellChecker

In [None]:
def correct_spelling(text):
    # initialize spellchecker object
    spell = SpellChecker()
    inc_words = []
    # split text into words
    for i in range(len(text)):
      inc_word = text[i].split()
      for j in range(len(inc_word)):
        inc_words.append(inc_word[j])
     
    corrected_words = []
    for i in range(len(inc_words)):
      corrected_word = spell.correction(inc_words[i])
      corrected_words.append(corrected_word)
      
    # # join the corrected words back into a string
    # corrected_text = ''.join(str(corrected_words))
    return corrected_words
   
corrected_result = correct_spelling(result)
corrected_result

['SAW', 'YOU', 'HAD', 'negative', 'NUMBER', 'SO', None, 'OUT']

In [None]:
string = ' '.join(str(x) for x in corrected_result)
string = string.replace(","," ")
string

'SAW YOU HAD negative NUMBER SO None OUT'

****

In [None]:
# Putting Text in model for Prediction: 
# For Negative Comments: 
predicted_result1 = pipeline1.predict(list(string))
predicted_result 

array([1., 0.])

In [None]:
# For Offensive Comments: 
predicted_result2 = pipeline2.predict(list(string))
predicted_result2

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## **Conclusion:**<br>
Text extraction and classification of social media content from images offers a valuable solution for businesses and social media platforms to better understand their users' preferences and opinions, as well as to detect and prevent harmful content. This can help to create a safer, more inclusive, and engaging social media experience for all users.