In [None]:
import sys
import os

# Directory of clean_cresci_2015.py to sys.path
sys.path.append(os.path.abspath("git/clean_cresci_2015.py"))

# import clean_cresci_2015
from clean_cresci_2015 import clean_cresci_2015
from import_data import ImportData
from evaluation import Evaluate

# Clean Cresci-2015 Dataset

## Overview
This script is designed to clean and preprocess the Cresci-2015 dataset, which consists of Twitter data related to social bots and genuine users. The script performs various data cleaning and preprocessing steps to prepare the data for further analysis or machine learning tasks.

## Dependencies
- pandas: Data manipulation library in Python.
- numpy: Numerical computing library in Python.
- os: Operating system interface for file operations.
- datetime: Library for manipulating dates and times in Python.
- sklearn.preprocessing.MinMaxScaler: Class for scaling numerical features to a specified range.
- sklearn.model_selection.train_test_split: Function for splitting data into training and testing sets.

## Class: clean_cresci_2015
This class contains a method `clean_data` which performs the following steps:

1. **Loading Data**: 
   - The script loads tweet and user data from CSV files located in subdirectories of the base directory specified as `base_directory`.

2. **Feature Selection**:
   - Selects relevant features from both tweet and user datasets.

3. **Data Type Conversion**:
   - Converts the 'created_at' column in the users dataset to datetime format.

4. **Handling Missing Values**:
   - Fills missing values with zeros for numeric columns in both tweet and user datasets.

5. **Feature Engineering**:
   - Calculates additional features such as 'account_age_years', 'followers_to_friends_ratio' in the users dataset.
   - Calculates tweet-level features such as 'retweet_ratio' and 'reply_ratio'.

6. **Normalization**:
   - Scales numeric features in both tweet and user datasets to a range between 0 and 1 using Min-Max scaling.

7. **Data Merging**:
   - Merges the tweet and user datasets on the 'user_id' and 'id' columns respectively.

8. **Bot Labeling**:
   - Adds a binary 'bot' label based on the folder name.

9. **Saving Cleaned Data**:
   - Saves the cleaned and processed dataframes as CSV files in a 'clean' subdirectory within each dataset's folder.

## Example Usage:



In [None]:
cleaner = clean_cresci_2015()
cleaner.clean_data(base_directory="git/cresci-2015.csv/")

# Evaluation Metrics and Visualization

## Overview
This script defines a class `Evaluate` that computes various evaluation metrics for binary classification models and provides methods to visualize the evaluation results.

## Dependencies
- pandas: Data manipulation library in Python.
- numpy: Numerical computing library in Python.
- matplotlib.pyplot: Plotting library in Python.
- seaborn: Statistical data visualization library based on matplotlib.
- sklearn.metrics: Collection of metrics to evaluate the performance of machine learning models.

## Class: Evaluate
This class contains methods to compute the following evaluation metrics:

1. **Accuracy**: Computes the accuracy score of the predictions.
2. **Confusion Matrix**: Computes the confusion matrix rates (True Negative Rate, False Positive Rate, False Negative Rate, True Positive Rate).
3. **Precision**: Computes the precision score.
4. **Recall**: Computes the recall score.
5. **F1 Score**: Computes the F1 score, which is the harmonic mean of precision and recall.
6. **Matthews Correlation Coefficient (MCC)**: Computes the MCC, which measures the correlation between predicted and true binary classifications.
7. **Area Under the ROC Curve (AUC)**: Computes the AUC score, which measures the area under the Receiver Operating Characteristic (ROC) curve.

Additionally, the class provides methods to:
- Get all evaluation metrics at once.
- Plot the confusion matrix.
- Plot the ROC curve.

## Usage
1. Instantiate the `Evaluate` class with true values and predicted values.
2. Optionally, provide predicted probabilities for computing AUC.
3. Call individual metric methods or use `get_all_metrics` to obtain all metrics at once.
4. Use `plot_confusion_matrix` to visualize the confusion matrix.
5. Use `plot_roc_curve` to visualize the ROC curve.

## Example Usage:


In [None]:
true_values = [0, 1, 1, 0, 1, 0, 0, 1]
predicted_values = [0, 1, 1, 0, 1, 0, 1, 1]
predicted_probabilities = [0.1, 0.9, 0.7, 0.2, 0.8, 0.3, 0.6, 0.4]

evaluator = Evaluate(true_values, predicted_values, predicted_probabilities)
print(evaluator.get_all_metrics())

evaluator.plot_confusion_matrix()
evaluator.plot_roc_curve()

# Data Import and Splitting

## Overview
This script provides functionality for importing data from the Cresci-2015 dataset and splitting it into training, testing, and validation sets. It also allows for sampling the data based on specified bot ratios.

## Dependencies
- os: Operating system interface for file operations.
- pandas: Data manipulation library in Python.
- sklearn.model_selection.train_test_split: Function for splitting data into training and testing sets.

## Class: ImportData
This class contains methods to perform the following tasks:

1. **Data Typing**:
   - Determines the type of data to import based on the provided parameter.

2. **Reading and Sampling Data**:
   - Reads the non-bot and bot dataframes from CSV files.
   - Samples the bot data based on specified bot ratios and combines it with non-bot data.
   
3. **Splitting Dataset**:
   - Splits the dataset into training, testing, and validation sets based on the provided proportions.
   - Stratifies the split based on the target feature to maintain the distribution of classes in each split.

## Methods:

### `type_data(self, type_data_merged)`
Determines the type of data to use based on the provided parameter (`type_data_merged`).

### `read_and_sample_data(self, base_path="../Data/cresci-2015.csv/", type_data_merged=1, bot_ratio=[.2, .8], bot_fldr_ratio=[1, 1, 1])`
Reads and samples data from CSV files based on the provided parameters.
- `base_path`: The base directory containing the dataset files.
- `type_data_merged`: Type of data to use (merged or user-specific).
- `bot_ratio`: Desired ratio of bot samples in the final dataset.
- `bot_fldr_ratio`: Ratios of bot samples from different bot folders.

### `split_dataset(self, data, proportions=[.7, .15, .15], target='bot')`
Splits the dataset into training, testing, and validation sets.
- `data`: The DataFrame to split.
- `proportions`: Proportions of training, testing, and validation sets.
- `target`: The name of the target feature.

## Example Usage:

In [None]:
importer = ImportData()

# Read and sample data
data = importer.read_and_sample_data()
print("Sampled Data:")
print(data.head())

# Split dataset
split_data = importer.split_dataset(data)
print("\nSplit Data:")
for key, value in split_data.items():
    print(f"{key}: {len(value)}")

