<a href="https://colab.research.google.com/github/RachelNderitu/RachelNderitu/blob/main/Hatespeech_Kenya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The below code reads the file and stores it as a dataframe.

In [13]:
import pandas as pd
path='/content/HateSpeech_Kenya.csv'
df=pd.read_csv(path,sep=",",encoding='utf-8')

In [14]:
# See the first few rows of the data
df.head()

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet
0,0,0,3,0,['The political elite are in desperation. Ordi...
1,0,0,3,0,"[""Am just curious the only people who are call..."
2,0,0,3,0,['USERNAME_3 the area politicians are the one ...
3,0,0,3,0,['War expected in Nakuru if something is not d...
4,0,0,3,0,['USERNAME_4 tells kikuyus activists that they...


In [15]:
#To view the columns names
df.columns


Index(['hate_speech', 'offensive_language', 'neither', 'Class', 'Tweet'], dtype='object')

The below code handles data cleaning; removing duplicates and missing values.


In [16]:
#Data Cleaning and text normalization
#Remove duplicates
df.drop_duplicates(inplace=True)

#Handling missing values
df.dropna(inplace=True)

The below code does the following:

1. **Removal of Punctuation and Capitalization**
   - **Removal of Punctuation** - Eliminates characters like periods, commas, and exclamation marks from the text to clean it up.
   - **Capitalization** - Converts all text to lowercase to ensure uniformity and prevent the model from treating the same word in different cases as distinct.

2. **Tokenizing**
   - Splits the text into smaller units called tokens, which are typically words or phrases. This makes it easier to analyze and process the text.

3. **Removal of Stopwords**
   - **Stopwords** - Common words (like "and," "the," "is") that are often removed because they carry little meaningful information for analysis.

4. **Stemming**
   - Reduces words to their root form (e.g., "running" to "run"). This helps in normalizing different forms of a word to a common base form for easier comparison and analysis.

In [17]:
import pandas as pd
import re
import unicodedata
from nltk.corpus import stopwords
import nltk
from nltk.stem import PorterStemmer
nltk.download('stopwords')

## 1. Removal of punctuation and capitlization
## 2. Tokenizing
## 3. Removal of stopwords
## 4. Stemming

# Define stopwords and stemmer
stopwords_list = stopwords.words("english")
other_exclusions = ["#ff", "ff", "rt"]
stopwords_list.extend(other_exclusions)
stemmer = PorterStemmer()

def preprocess(Tweet):
    # Removal of punctuation, numbers, and extra spaces
    tweet = re.sub(r'[^a-zA-Z]', ' ', Tweet)
    tweet = re.sub(r'\s+', ' ', tweet)
    tweet = tweet.strip()
    tweet = re.sub(r'\d+(\.\d+)?', 'numbr', tweet)
    tweet = tweet.lower()

    # Tokenizing
    tokens = tweet.split()

    # Removal of stopwords and stemming
    tokens = [stemmer.stem(token) for token in tokens if token not in stopwords_list]

    return ' '.join(tokens)

df['processed_tweets'] = df['Tweet'].apply(preprocess)

print(df[["Tweet","processed_tweets"]].head(10))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                               Tweet  \
0  ['The political elite are in desperation. Ordi...   
1  ["Am just curious the only people who are call...   
2  ['USERNAME_3 the area politicians are the one ...   
3  ['War expected in Nakuru if something is not d...   
4  ['USERNAME_4 tells kikuyus activists that they...   
5  ['USERNAME_6 USERNAME_7 Nowdays when you go to...   
6  ['We the kalenjins are planning to part ways w...   
7  ['r u sure kikuyus are the ones who want the w...   
8  ['According to Wandimi a staunch USERNAME_8 su...   
9  ["it's not tribalism...but kisiis kalenjins an...   

                                    processed_tweets  
0  polit elit desper ordinari kalenjin suspici ki...  
1  curiou peopl call old mad kikuyu kalenjin good...  
2  usernam area politician one blame coz r insit ...  
3  war expect nakuru someth done luo given seven ...  
4  usernam tell kikuyu activist target target use...  
5  usernam usernam nowday go seek justic polic st... 

The below code converts text data into numerical features using the TF-IDF method. The result, tfidf, is a matrix of numerical features that can be used for further analysis.

In [18]:
# To convert the text data into numerical features using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_df=0.75, min_df=5, max_features=10000)

# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(df['processed_tweets'] )

The below code splits the dataset into features and target.
It also splits the data into training and testing sets for further use in the logistic model.



In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Split the dataset into features (X) and target (y)
X = tfidf
y = df['Class'].astype(int)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)


The code below makes a logistic regression model, trains it, makes predictions, and evaluates it using the various metrics - precision, recall, f1-score and support.

In [20]:
# Logistic Regression
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)
y_pred_lr = logistic_regression_model.predict(X_test)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))
acc_lr = accuracy_score(y_test, y_pred_lr)
print("Logistic Regression Accuracy Score:", acc_lr)

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.97      0.86      7166
           1       0.44      0.12      0.19      1806
           2       0.47      0.11      0.18       644

    accuracy                           0.75      9616
   macro avg       0.56      0.40      0.41      9616
weighted avg       0.69      0.75      0.69      9616

Logistic Regression Accuracy Score: 0.7516638935108153


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The code below makes a support vector machine classifier, trains it, makes predictions, and evaluates it using the various metrics - precision, recall, f1-score and support.

In [21]:
from sklearn.svm import SVC

# Using the SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
y_pred_svm = svm_classifier.predict(X_test)
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))
acc_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy Score:", acc_svm)

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.99      0.86      7166
           1       0.49      0.06      0.11      1806
           2       0.49      0.07      0.13       644

    accuracy                           0.75      9616
   macro avg       0.58      0.37      0.37      9616
weighted avg       0.69      0.75      0.67      9616

SVM Accuracy Score: 0.7510399334442596


I shall use the logistic regression model to make predictions, as it has a tad higher accuracy score.
The code below demonstrates how to predict the class of a new tweet using the trained logistic regression model in the following steps:
1. Preprocesses the new tweet
2. Transforms the processed tweet
3. Makes a prediction
4. Maps the predicted class to a label

In [22]:
# Prediction example 1
new_tweet = "She likes me."
new_processed_tweet = preprocess(new_tweet)
new_tfidf = tfidf_vectorizer.transform([new_processed_tweet])
prediction = logistic_regression_model.predict(new_tfidf)
# Map predicted class to corresponding label
if prediction[0] == 0:
    print("Predicted class for the new tweet:", "not hate speech")
elif prediction[0] == 1:
    print("Predicted class for the new tweet:", "offensive")
else:
    print("Predicted class for the new tweet:", "hate speech")

# Prediction example 2
new_tweet2 = "She is stupid."
new_processed_tweet2 = preprocess(new_tweet2)
new_tfidf2 = tfidf_vectorizer.transform([new_processed_tweet2])
prediction2 = logistic_regression_model.predict(new_tfidf2)
# Map predicted class to corresponding label
if prediction2[0] == 0:
    print("Predicted class for the new tweet:", "not hate speech")
elif prediction2[0] == 1:
    print("Predicted class for the new tweet:", "offensive")
else:
    print("Predicted class for the new tweet:", "hate speech")

# Prediction example 3
new_tweet3 = "Kamba are so stupid."
new_processed_tweet3 = preprocess(new_tweet3)
new_tfidf3 = tfidf_vectorizer.transform([new_processed_tweet3])
prediction3 = logistic_regression_model.predict(new_tfidf3)
# Map predicted class to corresponding label
if prediction3[0] == 0:
    print("Predicted class for the new tweet:", "not hate speech")
elif prediction3[0] == 1:
    print("Predicted class for the new tweet:", "offensive")
else:
    print("Predicted class for the new tweet:", "hate speech")


Predicted class for the new tweet: not hate speech
Predicted class for the new tweet: offensive
Predicted class for the new tweet: hate speech
