# Capston Project: Environment Setup and Instructions

This repository contains Jupyter notebooks and configurations for processing and analyzing 3W datasets. To ensure smooth execution, Conda environments are provided for each set of notebooks. Follow the instructions below to set up the environments and associate them with the correct files.

## **Environments Overview**

1. **`3w_env.yml`**:
   - **Location**: `Configurations/3w_env.yml`
   - **Purpose**: Supports the `3W_Data_Extraction.ipynb` notebook.

2. **`pyspark_env.yml`**:
   - **Location**: `Configurations/pyspark_env.yml`
   - **Purpose**: Supports the following notebooks:
     - `EDA/3W_Real.ipynb`
     - `Cleaning & Preparation/3W_Real_Cleaning_Preparation.ipynb`
     - `ML/Attempts/3W_PySpark_MLlib_Binary_by_label.ipynb`
     - `ML/Attempts/3W_PySpark_MLlib_MultiClassification.ipynb`
     - `ML/3W_PySpark_MLlib_Stacked_Model.ipynb`

## **Setup Instructions**

### 1. Clone the Repositories 

Clone both repositories to your local machine and ensure that they are both saved in the same directory:

Capstone: https://github.com/Abdulatif-AZ/CapStone  
3W Petrobras: https://github.com/petrobras/3W

```bash
git clone <repository_url>
cd <repository_name>
```

### 2. Install Conda (if not already installed)

Download and install Conda from the [official Conda website](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).

### 3. Install Java for PySpark

The `pyspark_env` environment requires Java. Use one of the following methods to install Java:

1. Download java from the [Java website](https://www.java.com/en/download/).  
   - Ensure to download the appropriate version for your device.
   ```
2. Verify Java installation:  

   ```bash
   java -version
   ```
   This should display the installed Java version.

### 4. Create Conda Environments

#### a. **Setting Up `3w_env`**

This environment is required for the `Data/3W_Data_Extraction.ipynb` notebook.

1. Navigate to the `Configurations` folder:
   ```bash
   cd Configurations
   ```
2. Create the environment from the `3w_env.yml` file:
   ```bash
   conda env create -f 3w_env.yml
   ```
3. Activate the environment:
   ```bash
   conda activate 3w_env
   ```



#### b. **Setting Up `pyspark_env`**

This environment is required for notebooks in the `EDA`, `Cleaning & Preparation`, and `ML` directories.

1. Navigate to the `Configurations` folder:
   ```bash
   cd Configurations
   ```
2. Create the environment from the `pyspark_env.yml` file:
   ```bash
   conda env create -f pyspark_env.yml
   ```
3. Activate the environment:
   ```bash
   conda activate pyspark_env
   ```

## **Notebook-Specific Instructions**

Below is the mapping of environments to notebooks:

| Notebook                                                                                  | Required Environment  |
|-------------------------------------------------------------------------------------------|-----------------------|
| `Data/3W_Data_Extraction.ipynb`                                                           | `3w_env`              |
| `EDA/3W_Real.ipynb`                                                                       | `pyspark_env`         |
| `Cleaning & Preparation/3W_Real_Cleaning_Preparation.ipynb`                               | `pyspark_env`         |
| `ML/Attempts/3W_PySpark_MLlib_Binary_by_label.ipynb`                                      | `pyspark_env`         |
| `ML/Attempts/3W_PySpark_MLlib_MultiClassification.ipynb`                                  | `pyspark_env`         |
| `ML/3W_PySpark_MLlib_Stacked_Model.ipynb`                                                 | `pyspark_env`         |



## **Verifying Setup**

After activating the appropriate environment, verify that the required packages are installed:
```bash
conda list
```

If any required package is missing, install it using:
```bash
conda install <package_name>
```

## **Running the Notebooks**

1. Activate the correct environment based on the notebook:
   - For `3w_env`: `conda activate 3w_env`
   - For `pyspark_env`: `conda activate pyspark_env`
2. Start Jupyter Notebook:
   ```bash
   jupyter notebook
   ```
3. Navigate to the required notebook and run it.

## **Troubleshooting**

### Environment Issues

- **Environment Not Found**: Ensure the correct Conda base is activated. If multiple Conda installations exist, explicitly specify the Conda binary path, e.g.:
  ```bash
  /path/to/conda/bin/conda env create -f <environment_file.yml>
  ```
- **Missing Packages**: Update the environment or install missing packages:
  ```bash
  conda env update -f <environment_file.yml>
  conda install <package_name>
  ```

### Common Errors

- **Java Not Found**: Ensure Java is installed and the `JAVA_HOME` environment variable is correctly configured.
- **Dependency Issues**: Run the following commands to resolve dependency conflicts:
  ```bash
  conda update conda
  conda update --all
  ```