# IE6600 Project

## Group Members

- Haoyuan Han
- Lingxuan Ye
- Mingxiao Liu

## Notice

All files under

- [_data](./_data/)
- [_ngrok](./_ngrok/)
- [venv](./venv/)

are retrieved and/or deployed automatically by [deploy.sh](./scripts/deploy.sh). Please make sure that al scripts are running under its own directory.

## Environment Requirement

This project is deployed locally, please make sure that there are softwares list below in your own machine:

- Python Interpret (3.3 and above) with its path in environment variable `$PATH`.

- Git with Git-Bash.

That’s all.

## Project Initiating

1. Execute [clone.sh](./scripts/utils/clone.sh) under any directory you want. You can find this file on [remote/clone.sh](https://raw.githubusercontent.com/Lingxuan-Ye/IE6600_project/main/scripts/utils/clone.sh), or just input command in your terminal as follow:

    ```
    git clone https://raw.githubusercontent.com/Lingxuan-Ye/IE6600_project/main/scripts/utils/clone.sh
    ```

2. Change directory from [project root](./) to [scripts](./scripts/), then execute [deploy.sh](./scripts/deploy.sh).

    - Please do not exit process with `CTRL` + `C` unless you know exactly what you are doing.
    
    - Please set your authtoken unless it is in your global settings. Authtoken registered here will be **ONLY** applicable in this project. If your haven’t set it, ngrok.exe will fail and since I have not figured out how to redirect its stderr yet, it will exit with no prompt.

3. Switch back to project root, then execute [run.sh](./run.sh). The public and local urls will be presented on your terminal.

    - Please note that the **ngrok** process will run on its own terminal interface, make sure it is properly terminated when you stop running the project.

## Data Cleaning and Preprocessing

### Data Retrieving

Initiate an `IO` instance.

In [None]:
from source import IO

In [None]:
io = IO(data_dir='./_data/', as_NaN=('?',))

In [None]:
io.raw

### Preprocessing

In [None]:
io.preprocess()
io.data

We can find out that the dataset has too many invalid values so that the size after dropping rows with `NaN` values is way too small.

Consider that the most columns of this dataset are irrelavant to our research, and that rows with `NaN` values only existing in those columns should not be dropped, we should filter the data then preprocess it.

### Data Filtering

Inspect data:

In [None]:
io.inspection

After discussion, our group picked columns listed below:

In [None]:
GENERAL = ['state', 'population', 'LandArea']
ECO = ['medIncome', 'PctPopUnderPov', 'PctUnemployed']
# RACE = ['racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp']
HOMELESS = ['NumInShelters', 'NumStreet']
SECURITY = ['LemasPctOfficDrugUn', 'ViolentCrimesPerPop']

COLUMNS = [*GENERAL, *ECO, *HOMELESS, *SECURITY]

Filter data with chosen columns:

In [None]:
filtered = io.raw[COLUMNS]

Further, for more explict interpretation, we rename column 'state' to 'fips' and insert 'state_name' and 'state_abbr' columns.

In [None]:
import pandas as pd

FIPS = pd.read_csv(
    './_data/state_fips_master.csv'
)[['fips', 'state_name', 'state_abbr']]

io.raw = pd.merge(FIPS, filtered.rename({'state': 'fips'}, axis=1))

### Preprocessing Again

In [None]:
io.preprocess()
io.data.describe()

Due to the large amount of missing data for 'county' and 'community' columns in original dataset, we will **ONLY** consider state-wise visualization. Therefore, we imported [geojson for states](https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json) from GitHub repository [PublicaMundi/MappingAPI](https://github.com/PublicaMundi/MappingAPI/blob/master/data/geojson/us-states.json).