# 1) Setup
## Modules

In [30]:
# Libraries
import pandas as pd
import pyarrow.parquet as pq

## Data

In [35]:
# Import data (Parquet / Fabien?)
data_business = pq.read_table('data/ATML2024_businesses.parquet')
df_business = data_business.to_pandas()

data_Train_reviews = pq.read_table('data/ATML2024_reviews_train.parquet')
df_Train_reviews = data_Train_reviews.to_pandas()

data_users = pq.read_table('data/ATML2024_users.parquet')
df_users = data_users.to_pandas()

In [40]:
# Get a brief view on the data sets

# USER DATASET
print("Dataset Schema:")
print(data_users.schema)
print("\nSummary Statistics:")
print(df_users.describe())

# REVIEWS DATASET
print("Dataset Schema:")
print(data_Train_reviews.schema)
print("\nSummary Statistics:")
print(df_Train_reviews.describe())

# BUSINESS DATASET
print("Dataset Schema:")
print(data_business.schema)
print("\nSummary Statistics:")
print(df_business.describe())

Dataset Schema:
user_id: string
name: string
user_since: string
useful: double
funny: double
cool: double
premium_account: string
friends: double
fans: double
compliment_hot: double
compliment_more: double
compliment_profile: double
compliment_cute: double
compliment_list: double
compliment_note: double
compliment_plain: double
compliment_cool: double
compliment_funny: double
compliment_writer: double
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2615

Summary Statistics:
              useful          funny           cool        friends  \
count  747468.000000  747468.000000  747468.000000  747468.000000   
mean       74.840044      31.100846      45.027212      61.342055   
std       965.774055     605.662158     849.247796     178.325034   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         5.000000       1.000000       1.000000      

### Transform data
Possibilities:
- Log when data is highly skewed.
- Binning of continuous variables.
- Simplifying of categories: Trying to aggregate to higher-level categories when it makes sense.

## Data quality assessment and profiling

1. Categorical: 
	- count
	- count distinct
	- assess unique values
2. Numerical: 
	- count
	- min
	- max

Do we have missing data ?
Are there variables that are numerical but really should be categorical?

# Exploratory Data Analysis

## 1. Exploring each individual variable

Quantify:

- _Location_:
	- mean
	- median
	- mode
	- interquartile mean
- _Spread_:
	- standard deviation
	- variance
	- range
	- interquartile range
- _Shape_:
	- skewness
	- kurtosis

For time series:
	plot summary statistics over time.

For panel data:
- Plot cross-sectional summary statistics over time
- Plot time-series statistics across the population

#### Question
- What does each field in the data look like?
    - Is the distribution skewed? Bimodal?
    - Are there outliers? Are they feasible?
    - Are there discontinuities?
- Are the typical assumptions seen in modeling valid?
    - Gaussian
    - Identically and independently distributed
    - Have one mode
    - Can be negative
    - Generating processes are stationary and isoptropic (time series)
    - Independence between subjects (panel data)

## 2. Exploring the relationship between each variable and the target

How does each field interact with the target?

Assess each relationship’s:
- Linearity
- Direction
- Rough size
- Strength
Methods:
- Bivariate visualizations
- Calculate correlation

## 3. Assessing interactions between variables

How do the variables interact with each other?
- Bivariate visualizations
- Correlation matrices
- Compare summary statistics of variable x for different categories of y


## 4. Exploring data across many dimensions

- Categorical:
    - Parallel coordinates
- Continuous
    - Principal component analysis
    - Clustering