In [None]:
import sys
import os

# Directory of clean_cresci_2015.py to sys.path
sys.path.append(os.path.abspath("git/clean_cresci_2015.py"))

# import clean_cresci_2015
from clean_cresci_2015 import clean_cresci_2015
from import_data import ImportData
from evaluation import Evaluate

# Clean Cresci-2015 Dataset

## Overview
This script is designed to clean and preprocess the Cresci-2015 dataset, which consists of Twitter data related to social bots and genuine users. The script performs various data cleaning and preprocessing steps to prepare the data for further analysis or machine learning tasks.

## Dependencies
- pandas: Data manipulation library in Python.
- numpy: Numerical computing library in Python.
- os: Operating system interface for file operations.
- datetime: Library for manipulating dates and times in Python.
- sklearn.preprocessing.MinMaxScaler: Class for scaling numerical features to a specified range.
- sklearn.model_selection.train_test_split: Function for splitting data into training and testing sets.

### Methods
- `clean_data`
Cleans the dataset by loading data from CSV files, selecting important features, converting data types, and filling missing values.

### Steps Performed:

1. **Loading Data**: 
   - The script loads tweet and user data from CSV files located in subdirectories of the base directory specified as `base_directory`.

2. **Feature Selection**:
   - Selects relevant features from both tweet and user datasets.

3. **Data Type Conversion**:
   - Converts the 'created_at' column in the users dataset to datetime format.

4. **Handling Missing Values**:
   - Fills missing values with zeros for numeric columns in both tweet and user datasets.

5. **Feature Engineering**:
   - Calculates additional features such as 'account_age_years', 'followers_to_friends_ratio' in the users dataset.
   - Calculates tweet-level features such as 'retweet_ratio' and 'reply_ratio'.

6. **Normalization**:
   - Scales numeric features in both tweet and user datasets to a range between 0 and 1 using Min-Max scaling.

7. **Data Merging**:
   - Merges the tweet and user datasets on the 'user_id' and 'id' columns respectively.

8. **Bot Labeling**:
   - Adds a binary 'bot' label based on the folder name.

9. **Saving Cleaned Data**:
   - Saves the cleaned and processed dataframes as CSV files in a 'clean' subdirectory within each dataset's folder.

### Returns:
- None. The method adjusts the DataFrame in place.

### Error Handling
- **FileNotFoundError**: This error may occur if the base directory does not exist or is incorrect. Ensure the directory path is correct and accessible.

- **DataError**: If there are issues in data consistency such as missing columns needed for selected features, the method will raise this error.

## Example Usage:



In [None]:
# Initialize the cleaner
cleaner = clean_cresci_2015()

# Specify the base directory if different from the default
cleaner.clean_data(base_directory="git/cresci-2015.csv/")

# Clean the data
dataset_cleaner.clean_data(base_directory)

# Evaluation Metrics and Visualization

## Overview
This script defines a class `Evaluate`. This class provides tools for assessing the performance of machine learning models on Twitter data. It includes methods for calculating various performance metrics, visualizing results, and interpreting the effectiveness of model predictions.

## Dependencies
- pandas: Data manipulation library in Python.
- numpy: Numerical computing library in Python.
- matplotlib.pyplot: Plotting library in Python.
- seaborn: Statistical data visualization library based on matplotlib.
- sklearn.metrics: Collection of metrics to evaluate the performance of machine learning models.

## Class: Evaluate
The `Evaluate` class encapsulates methods to compute detailed evaluation metrics and visualize the performance of binary classifiers:

**Metrics Provided**
1. **Accuracy**: Reflects the overall correctness of the model.
2. **Confusion Matrix**: Computes the confusion matrix rates (True Negative Rate, False Positive Rate, False Negative Rate, True Positive Rate).
3. **Precision**: Indicates the accuracy of positive predictions.
4. **Recall**: Measures the ability of the model to find all relevant instances.
5. **F1 Score**: Computes the F1 score, which is the harmonic mean of precision and recall.
6. **Matthews Correlation Coefficient (MCC)**: Computes the MCC, which measures the correlation between predicted and true binary classifications.
7. **Area Under the ROC Curve (AUC)**: Computes the AUC score, which measures the area under the Receiver Operating Characteristic (ROC) curve.

**Visualization Methods**
- **Plot Confusion Matrix**: Visualizes the confusion matrix in a color-coded format to aid in quick interpretation.
- **Plot ROC Curve**: Illustrates the diagnostic ability of the binary classifier system as its discrimination threshold varies.

Additionally, the class provides methods to:
- Get all evaluation metrics at once.
- Plot the confusion matrix.
- Plot the ROC curve.

## Usage
1. Instantiate the `Evaluate` class with true values and predicted values.
2. Optionally, provide predicted probabilities for computing AUC.
3. Call individual metric methods or use `get_all_metrics` to obtain all metrics at once.
4. Use `plot_confusion_matrix` to visualize the confusion matrix.
5. Use `plot_roc_curve` to visualize the ROC curve.

## Example Usage:


In [None]:
true_values = [0, 1, 1, 0, 1, 0, 0, 1]
predicted_values = [0, 1, 1, 0, 1, 0, 1, 1]
predicted_probabilities = [0.1, 0.9, 0.7, 0.2, 0.8, 0.3, 0.6, 0.4]

evaluator = Evaluate(true_values, predicted_values, predicted_probabilities)
print(evaluator.get_all_metrics())

evaluator.plot_confusion_matrix()
evaluator.plot_roc_curve()

# Data Import and Splitting

## Overview
This script provides functionality for importing data from the Cresci-2015 dataset and splitting it into training, testing, and validation sets. It also allows for sampling the data based on specified bot ratios.

## Dependencies
- `os`: Operating system interface for file operations.
- `pandas`: Data manipulation library in Python.
- `sklearn.model_selection.train_test_split`: Function for splitting data into training and testing sets.

## Class: ImportData
This class contains methods to perform the following tasks:

### Data Typing
- Determines the type of data to import based on the provided parameter.

### Reading and Sampling Data
- Reads the non-bot and bot dataframes from CSV files.
- Samples the bot data based on specified bot ratios and combines it with non-bot data.

### Splitting Dataset
- Splits the dataset into training, testing, and validation sets based on the provided proportions.
- Stratifies the split based on the target feature to maintain the distribution of classes in each split.

## Methods

`load_data(file_path)`\
Loads data from a specified CSV file into a pandas DataFrame.

- **Parameters:**
  - `file_path` (str): Path to the CSV file.
- **Returns:**
  - `DataFrame`: A pandas DataFrame containing the loaded data.

`type_data(self, type_data_merged)`\
Determines the type of data to use based on the provided parameter (`type_data_merged`).

- **Parameters:**
  - `type_data_merged` (int): Parameter to determine the type of data to use.

`read_and_sample_data(self, base_path="../Data/cresci-2015.csv/", type_data_merged=1, bot_ratio=[0.2, 0.8], bot_fldr_ratio=[1, 1, 1])`\
Reads and samples data from CSV files based on the provided parameters.

- **Parameters:**
  - `base_path` (str): The base directory containing the dataset files.
  - `type_data_merged` (int): Type of data to use (merged or user-specific).
  - `bot_ratio` (list): Desired ratio of bot samples in the final dataset.
  - `bot_fldr_ratio` (list): Ratios of bot samples from different bot folders.

`split_dataset(self, data, proportions=[0.7, 0.15, 0.15], target='bot')`
Splits the dataset into training, testing, and validation sets.

- **Parameters:**
  - `data` (DataFrame): The DataFrame to split.
  - `proportions` (list): Proportions of training, testing, and validation sets.
  - `target` (str): The name of the target feature.
- **Returns:**
  - `dict`: A dictionary containing 'X_train', 'X_test', 'X_val', 'y_train', 'y_test', and 'y_val' DataFrames/Series.

## Example Usage

In [None]:
importer = ImportData()

# Read and sample data
data = importer.read_and_sample_data()
print("Sampled Data:")
print(data.head())

# Split dataset
split_data = importer.split_dataset(data)
print("\nSplit Data:")
for key, value in split_data.items():
    print(f"{key}: {len(value)}")



## FeatureSelection Class

**Description**:
This class is designed to perform feature selection on a given dataset for binary classification tasks. It supports three types of feature selection methods: correlation analysis, chi-squared test, and mutual information classifier. Additionally, it provides visualization capabilities such as pair plots and correlation heatmaps to assist in understanding the relationship between features and the target variable.

**Attributes**:
data: DataFrame: The input dataset containing both features and the target variable.
X: DataFrame: Features of the dataset.
y: Series: Target variable of the dataset.
values: DataFrame: Stores the feature importance values obtained from feature selection methods.
list_values: List: Stores the names of selected features.

**Methods**:
1. __init__(data): Constructor method that initializes the FeatureSelection object with the input dataset.

2. select_features(type_selection): Method to select features based on the specified type of feature selection method. It returns a list of selected feature names.

3. correlation(): Method to perform correlation analysis and rank features based on their correlation with the target variable.

4. chi2(): Method to perform chi-squared test for feature selection and rank features based on their importance scores.

5. mutual_classifier(): Method to compute feature importance using mutual information classifier and rank features accordingly.

6. pair_plot(num_feat): Method to generate a pair plot of selected features, optionally specifying the number of features to include.

7. correlation_map(num_feat): Method to generate a correlation heatmap of selected features, optionally specifying the number of features to include.

**Parameters**:\
type_selection: str, default='correlation': Specifies the type of feature selection method to use ('correlation', 'chi2', or 'classifier').

num_feat: {'all', int}, default='all': Specifies the number of features to include in pair plot and correlation heatmap. If 'all', all features are included; otherwise, an integer value determines the number of features.

**Example usage**


In [None]:
# Create FeatureSelection object
fs = FeatureSelection(data)

# Perform feature selection using correlation analysis
selected_features = fs.select_features(type_selection='correlation')

# Generate pair plot
fs.pair_plot(num_feat=5)

# Generate correlation heatmap
fs.correlation_map(num_feat=10)


**Dependencies**:\
pandas: For data manipulation and handling.\
seaborn: For visualization of pair plots and correlation heatmaps.\
matplotlib.pyplot: For additional customization of visualizations.\


# ModelTester Class

## Description:
The `ModelTester` class is designed to facilitate the testing and evaluation of machine learning models for binary classification tasks. It provides functionality for model initialization, parameter tuning via grid search, prediction generation, and model persistence. Additionally, it allows users to perform feature selection and specify the number of features to consider during model training and testing.

## Attributes:
- `X_train`: DataFrame: Training features.
- `X_test`: DataFrame: Test features.
- `X_val`: DataFrame: Validation features.
- `y_train`: Series: Training target variable.
- `y_test`: Series: Test target variable.
- `y_val`: Series: Validation target variable.
- `feature_importance`: List: Feature importance rankings.
- `models`: Dictionary: Stores trained models.

## Methods:

1. `__init__(data_dict, feature_importance)`: Constructor method that initializes the `ModelTester` object with data and feature importance rankings.

2. `load_models()`: Loads pre-trained models with parameters from the 'Parameters' folder.

3. `change_model_parameters(model_name, new_params)`: Modifies parameters of the specified model.

4. `save_current_parameters(model_name)`: Saves the current parameters of the specified model to the 'Parameters' folder.

5. `fit_all_models(num_features=None)`: Fits all models with the training data.

6. `grid_search(model_name, param_grid=None, scoring='f1', num_features=None, save_feature=False)`: Performs grid search to fine-tune parameters for the specified model.

7. `predict_model(model_name, num_features=None)`: Generates predictions (class labels and probabilities) for the chosen model using test and validation data.

## Parameters:

- `data_dict`: Dictionary containing training, test, and validation data (X_train, X_test, X_val, y_train, y_test, y_val).
- `feature_importance`: List of feature importance rankings.
- `model_name`: Name of the model (e.g., 'logistic_regression', 'knn', 'svm', 'decision_tree').
- `new_params`: Dictionary containing new parameter values for the specified model.
- `param_grid`: Dictionary specifying the range of parameters for grid search.
- `scoring`: Scoring metric for grid search optimization (default is 'f1').
- `num_features`: Number of features to consider in model training and testing.
- `save_feature`: Boolean indicating whether to save the best parameters obtained from grid search.

## Example Usage:
```python
# Create ModelTester object
tester = ModelTester(data_dict, feature_importance)

# Perform grid search for logistic regression model
best_params = tester.grid_search('logistic_regression')

# Generate predictions for logistic regression model
predictions = tester.predict_model('logistic_regression')


## Dependencies:
os: For file operations and directory manipulation.\
joblib: For model persistence.\
GridSearchCV: From sklearn.model_selection for parameter tuning.\
classification_report: From sklearn.metrics for generating classification reports.

# TestEnvironment Class

## Description:
The `TestEnvironment` class facilitates the setup and execution of machine learning model testing in a controlled environment. It allows users to define various parameters related to dataset configuration, feature selection, model selection, hyperparameter tuning, and result saving. Additionally, it integrates data importation, feature selection, model testing, and result saving functionalities into a cohesive workflow.

## Attributes:
- `DATASET`: Name of the dataset.
- `BOT_FOLDERS`: List of folders containing bot data.
- `BOT_RATIO`: Ratio of bot data to total data.
- `MERGED_DATASET`: Indicates whether the dataset is merged.
- `TYPE_SELECTION`: Type of feature selection method.
- `TRAIN_RATE`: Proportion of data used for training.
- `TEST_RATE`: Proportion of data used for testing.
- `VAL_RATE`: Proportion of data used for validation.
- `MODEL`: Type of model to be tested.
- `FEATURES`: Number of features to consider during testing.
- `MODEL_P`: Dictionary of model parameters.
- `GRID_SEARCH`: Boolean indicating whether to perform grid search.

## Methods:

1. `__init__(DATASET, BOT_FOLDERS, BOT_RATIO, MERGED_DATASET, TYPE_SELECTION, TRAIN_RATE, TEST_RATE, VAL_RATE, MODEL, FEATURES, MODEL_P, GRID_SEARCH)`: Constructor method that initializes the `TestEnvironment` object with specified parameters.

2. `save_results(model_parametres, test_metrics, val_metrics, MODEL)`: Saves the testing results to a CSV file.

3. `run_tests()`: Runs tests for the specified model(s) and returns the results.

## Parameters:

- `DATASET`: Name of the dataset to be tested.
- `BOT_FOLDERS`: List of folders containing bot data.
- `BOT_RATIO`: Ratio of bot data to total data.
- `MERGED_DATASET`: Boolean indicating whether the dataset is merged.
- `TYPE_SELECTION`: Type of feature selection method ('correlation', 'chi2', 'classifier').
- `TRAIN_RATE`: Proportion of data used for training.
- `TEST_RATE`: Proportion of data used for testing.
- `VAL_RATE`: Proportion of data used for validation.
- `MODEL`: Type of model to be tested ('all' or specific model name).
- `FEATURES`: Number of features to consider during testing.
- `MODEL_P`: Dictionary of model parameters for hyperparameter tuning.
- `GRID_SEARCH`: Boolean indicating whether to perform grid search for hyperparameter tuning.

## Example Usage:
```python
# Create TestEnvironment object
env = TestEnvironment(DATASET, BOT_FOLDERS, BOT_RATIO, MERGED_DATASET, TYPE_SELECTION, TRAIN_RATE, TEST_RATE, VAL_RATE, MODEL, FEATURES, MODEL_P, GRID_SEARCH)

# Run tests
results = env.run_tests()


## Dependencies:
- `os`: For file operations and directory manipulation.\
- `pandas`: For data manipulation and handling.\
- `ImportData`: Custom class for importing data.\
- `Evaluate`: Custom class for model evaluation.\
- `FeatureSelection`: Custom class for feature selection.\
- `ModelTester`: Custom class for model testing.