GitHub - Aaryan-2/Hackathon

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
HateSpeechData.csv		HateSpeechData.csv
README.txt		README.txt
final_customization.ipynb		final_customization.ipynb

Repository files navigation

The project titled "Hate Speech Detection in Social Media using Python" aims to detect hate speech on Twitter using Natural Language Processing (NLP) techniques and Machine Learning. The project is inspired by the work of t-davidson. The project builds upon the existing work by proposing new findings and analyzing changes that occur when new features are introduced.

The motivation behind this project is the problem of the misuse of social media platforms like Twitter, which allows for freedom of speech on the Internet. The project involves conducting a comprehensive research work by referring to existing works in this field, identifying gaps present in the existing works, and finding ways to solve those problems. The project uses a publicly available dataset provided by CrowdFlower and applies NLP techniques to achieve its goal.

The flow of the project starts with an analysis of the dataset, followed by text pre-processing to achieve a cleaner dataset that can be used in the next step, which is feature engineering. Unique and important features are extracted, and different sets of features are combined for comparison and analysis of the performance of various machine learning classification algorithms. Finally, the project conducts an in-depth analysis of the results obtained and explains the reasons for misclassifications in the model.

The logistic regression algorithm consistently works well with all feature sets except for F7, where precision, recall, and the f1-score for the "hate" label results in zero. The Random Forest classifier performs well when it comes to F1 and shows a significant performance in all other feature sets, but its performance is hugely impacted when tf-idf scores are not included in the feature set. The overall performance of the Naïve Bayes classifier is found to be less significant for the purpose of classifying tweets into hate, offensive, or neither labels. Still, it performs significantly better with feature set F7 compared to other feature sets. The SVM classifier also seems to be consistent throughout all feature sets except for F4 and F7. The project finds that the most important feature is F1, i.e., the tf-idf scores, which helps in better classification of hate speech. The sentiment scores also prove to be an important feature for differentiating hate speech and offensive language. Doc2vec columns are not found to be very significant in the classification purpose. On comparing all the graphs, the Random Forest is clearly the winner.

To form the hate speech dataset, the project collects data, which is a challenging task because what might be hate speech for someone might be normal text for someone else. The project uses text pre-processing techniques to remove unwanted content from the dataset, such as punctuations, tokenizing, stopwords removal, stemming, and removal of urls and mention names. The processed text is passed further for feature extraction, where features like n-gram tf-idf weights, sentiment polarity scores, doc2vec vector columns, and other readability scores are extracted and concatenated in different sets to fit into different classification models. These classification models are evaluated based on accuracy and f1-scores regarding different feature sets.

The project's results show that differentiating hate speech and offensive language is a challenging task, and it indicates the benefits of using the proposed features. The project provides a valuable resource for detecting the problem of toxic language on Twitter, although a detailed analysis of the features and errors could lead to more robust feature extraction methods and help solve the existing challenges in this field.