# Data Collection and Preparation Notebook 📊

## Introduction 📝

Welcome to the "Data Collection and Preparation" notebook (01_get_data) 🚀. In this notebook, we'll focus on the initial steps of data collection, loading, and preparation for your project. We'll be working with various datasets, including Reddit and Twitter data, to create a consolidated and cleaned dataset that will be used for further analysis.

## Table of Contents 📑

- [Importing Libraries](#importing-libraries)
- [Loading Datasets](#loading-datasets)
    - [Show dataframe](#show-dataframe)
- [Concatenating Datasets](#concatenating-datasets)
- [Data Cleaning](#data-cleaning)
    - [Final Data Summary](#final-data-summary)
- [Saving Cleaned Data](#saving-cleaned-data)
- [References](#references)

## Importing Libraries 📚

Let's start by importing the necessary libraries for our data processing tasks.

In [1]:
import pandas as pd
import numpy as np

## Loading Datasets 📊

We will load the various datasets required for our analysis, including Reddit, Twitter, and other relevant data.

In [2]:
df = pd.read_csv("./../data/raw_data/Reddit_Data.csv")
df1 = pd.read_csv("./../data/raw_data/Test.csv")
df2 = pd.read_csv("./../data/raw_data/Train.csv")
df3 = pd.read_csv("./../data/raw_data/Twitter_Data.csv")
df4 = pd.read_csv("./../data/raw_data/Valid.csv")
df5 = pd.read_csv("./../data/raw_data/dataset.csv")

### Show dataframe 💡

Display all dataframe

In [3]:
print(df, df1, df2, df3, df4, df5, sep='\n')

                                           clean_comment  category
0       family mormon have never tried explain them t...         1
1      buddhism has very much lot compatible with chr...         1
2      seriously don say thing first all they won get...        -1
3      what you have learned yours and only yours wha...         0
4      for your own benefit you may want read living ...         1
...                                                  ...       ...
37244                                              jesus         0
37245  kya bhai pure saal chutiya banaya modi aur jab...         1
37246              downvote karna tha par upvote hogaya          0
37247                                         haha nice          1
37248             facebook itself now working bjp’ cell          0

[37249 rows x 2 columns]
                                                   text  label
0     I always wrote this series off as being a comp...      0
1     1st watched 12/7/2002 - 3 out of 10(Di

## Concatenating Datasets 📁

Now, let's concatenate the loaded datasets to create a unified dataset for analysis.

In [4]:
dfs = [df.clean_comment, df1.text, df2.text, df3.clean_text, df4.text, df5.Text]

df_result = pd.concat(dfs, axis=0)
df_result.reset_index(drop=True, inplace=True)

## Data Cleaning 🧹

We will perform data cleaning tasks such as handling missing values and duplicates to ensure the quality of our dataset.

In [5]:
df_cleaned = df_result.replace('', np.nan).dropna().drop_duplicates().reset_index(drop=True).to_frame(name='text')
df_cleaned

Unnamed: 0,text
0,family mormon have never tried explain them t...
1,buddhism has very much lot compatible with chr...
2,seriously don say thing first all they won get...
3,what you have learned yours and only yours wha...
4,for your own benefit you may want read living ...
...,...
1178833,@Juice_Lemons in the dark. it’s so good
1178834,8.SSR &amp; Disha Salian case should be solved...
1178835,*ACCIDENT: Damage Only* - Raleigh Fire Depart...
1178836,@reblavoie So happy for her! She’s been incred...


### Final Data Summary 📊

Let's check the summary of the cleaned dataset.

In [6]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1178838 entries, 0 to 1178837
Data columns (total 1 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1178838 non-null  object
dtypes: object(1)
memory usage: 9.0+ MB


## Saving Processed Data 💾

Finally, we'll save the cleaned dataset to a compressed CSV file.

In [7]:
df_cleaned.to_parquet('./../data/Text_dataset.br', engine='pyarrow',compression='brotli', index=False)

## Conclusion 📝

This notebook demonstrates the initial steps of data collection, loading, and preparation. We've successfully concatenated and cleaned the data from various sources to create a unified and cleaned dataset. This dataset can now serve as a foundation for more advanced analysis and modeling tasks in your project.

Happy coding! 🎉

## References 📚

Sources of data
- [Twitter and Reddit Sentimental analysis Dataset](https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset)
- [Sentiment Dataset with 1 Million Tweets](https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets)
- [IMDB dataset (Sentiment analysis) in CSV format](https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format)