# MCR DAS WS2024 Homework 1

### Specifics:

This notebook was produced on a machine which runs the latest stable version of Ubuntu 22.04 running python3.12.1.

As such, in order to provide the best possible experience, it is recommended to run the notebook in a virtual environment, rather than directly on the machine.

Furthermore, the following script can be run to install the necessary dependencies (Find the contents of `requirements.txt` as an appendix to this assignment):

```sh
pip install setuptools && \
    pip install numpy && \
    pip install pandas && \
    pip install tqdm && \
    pip install joblib && \
    pip install category_encoders && \
    pip install polars && \
    pip install -r requirements.txt --no-build-isolation
```

In case you wanted to use `uv` as a package manager, this slightly adjusted script can be used instead:

```sh
uv pip install setuptools && \
    uv pip install numpy && \
    upv pip install pandas && \
    uv pi install tqdm && \
    uv pip install joblib && \
    uv pip install category_encoders && \
    uv pip install polars && \
    uv pip install mrmr-selection && \
    uv pip install -r requirements.txt --no-build-isolation
```

### Disclaimer

I have my own pipelines for datascience-related tasks, so I adapted them to this assignment.
As a direct consequence, I have not used pandas or numpy in this notebook.

Instead, I have used [polars](https://pola.rs/), which is a fast, in-memory dataframe library that is more suitable for my needs.

## Project Scope

The aim is to estimate the doctors' fees for a given dataset.

## Task 1: Data Analysis

#### A) Import the data from the “doctors_fee.csv” file into a dataframe.

In [44]:
# Step 0: Imports

import polars as pl
from polars import read_csv, scan_csv
import altair as alt

In [45]:
# Step 1: Import and format the data correctly

df = read_csv(
    "doctors_fees-v2.csv",
    separator=";",  # the separator is a semicolon
    decimal_comma=True,  # a comma was used to separate decimals
    schema={
        "ID": pl.UInt32,
        "Age": pl.UInt8,
        "Sex": pl.Categorical,  # I believe sex should be treated as categorical
        "BMI": pl.Float32,
        "Children": pl.UInt8,
        "Smoker": pl.String,
        "Region": pl.Categorical,
        "Charges": pl.Float64,
    },
).with_columns(pl.col("Smoker") == "yes")  # Match the string "yes" to the boolean True

#### B) How many rows and how many columns does the data frame have?

In [46]:
df.shape

(1338, 8)

##### Answer:

The dataframe has 1338 rows and 8 columns.

#### C) Delete the column id from the dataframe.

In [47]:
ids = df.drop_in_place(
    "ID"
)  # Keeping a reference to the ID column, while dropping it from the dataframe

#### D) Change the column names to lower case.

In [48]:
df = df.rename(str.lower)  # Python builtins trick

#### E) Change the column names “sex” to “gender”.

In [49]:
df = df.rename({"sex": "gender"})

#### F) In which columns are there how many missing values?


In [50]:
df.null_count()

age,gender,bmi,children,smoker,region,charges
u32,u32,u32,u32,u32,u32,u32
0,0,0,0,1,0,0


##### Answer:

There is 1 missing value in the `smokers` column.

#### G) If there are missing values, delete the corresponding rows from the dataframe.


In [51]:
df = df.drop_nulls()
df.shape  # Double check to see if exactly 1 row was purged

(1337, 7)

#### H) Replace „female“ with 0, and „Male“ with 1.

In [52]:
df = df.with_columns(
    (pl.col("gender") == "male").cast(pl.UInt8)
)  # Male was to be encoded with 1 so I had to do a comparison

#### I) Have you found incorrect values (which ones)?

I checked a data summary for my dataframe

In [53]:
df.describe()

statistic,age,gender,bmi,children,smoker,region,charges
str,f64,f64,f64,f64,f64,str,f64
"""count""",1337.0,1337.0,1337.0,1337.0,1337.0,"""1337""",1337.0
"""null_count""",0.0,0.0,0.0,0.0,0.0,"""0""",0.0
"""mean""",39.198953,0.504862,33.086456,1.147345,0.204936,,13246.999375
"""std""",14.052113,0.500163,88.718895,2.306586,,,12117.303298
"""min""",18.0,0.0,15.96,0.0,0.0,,-7151.092
"""25%""",27.0,0.0,26.315001,0.0,,,4719.73655
"""50%""",39.0,1.0,30.4,1.0,,,9377.9047
"""75%""",51.0,1.0,34.700001,2.0,,,16586.49771
"""max""",64.0,1.0,3267.0,73.0,1.0,,63770.42801


And I found out that there was at least one clear outlier in the `bmi` column: the max was 3267.0, way out of proportion.

#### J) Estimate the distribution of age with a visualization. Comment on this distribution.

In [54]:
df.get_column("age").plot.hist()

#### K) What is the distribution of gender? What is the distribution on smoker? What is the distribution on region? Comment these distributions.

In [None]:
df.get_column("gender").plot.hist()
df.get_column("smoker").plot.hist()
df.get_column("region").plot.hist()

#### L) Estimate the distribution of “bmi” with a visualization. Comment on this distribution

In [None]:
df.get_column("bmi").plot.hist()

#### M) What is the distribution of the charges?

In [None]:
df.get_column("charges").plot.hist()

#### N) Create a correlation matrix (with the numeric variables including gender). Comment all correlation values. E.g.: on average, do women or mean have higher charges? Which correlations look strange?

In [55]:
df.select(pl.selectors.numeric()).corr()

age,gender,bmi,children,charges
f64,f64,f64,f64,f64
1.0,-0.021437,-0.016139,-0.011135,0.298274
-0.021437,1.0,0.030247,-0.014571,0.057151
-0.016139,0.030247,1.0,-0.013117,-0.010484
-0.011135,-0.014571,-0.013117,1.0,0.05496
0.298274,0.057151,-0.010484,0.05496,1.0


#### O) Create a scatterplot with „charges“ vs age. Comment this plot.


In [56]:
df.plot.scatter(x="age", y="charges")

#### P) Create a „side-by-side-boxplot“ of your choice. Comment this plot.

In [61]:
boxplot = (
    alt.Chart(df)
    .mark_boxplot()
    .encode(
        x=alt.X("age", axis=alt.Axis(title="age")),
        y=alt.Y("charges", axis=alt.Axis(title="charges")),
        color=alt.Color("gender", legend=alt.Legend(title="gender")),
    )
)

boxplot.show()

#### Q) Create a new column „bmi_age“ – as the multiplication of bmi and age.

In [None]:
df = df.with_columns(pl.col("bmi") * pl.col("age"))  # TODO: fix this

df.head()

age,gender,bmi,children,smoker,region,charges
u8,u8,f32,u8,bool,cat,f64
19,0,10071.899414,0,True,"""southwest""",16884.924
18,1,10941.479492,1,False,"""southeast""",1725.5523
28,1,25872.0,3,False,"""southeast""",4449.462
33,1,24725.746094,0,False,"""northwest""",21984.47061
32,1,29573.119141,0,False,"""northwest""",3866.8552


## Appendix A: Requirements

Paste the following into a file called `requirements.txt` to ensure you have all the right dependencies:

```r
altair==5.5.0
asttokens==2.4.1
attrs==24.2.0
category-encoders==2.6.4
comm==0.2.2
debugpy==1.8.9
decorator==5.1.1
executing==2.1.0
ipykernel==6.29.5
ipython==8.29.0
jedi==0.19.2
jinja2==3.1.4
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-client==8.6.3
jupyter-core==5.7.2
markupsafe==3.0.2
matplotlib-inline==0.1.7
mrmr-selection==0.2.8
narwhals==1.14.2
nest-asyncio==1.6.0
numpy==2.1.3
packaging==24.2
pandas==2.2.3
parso==0.8.4
patsy==1.0.1
pexpect==4.9.0
platformdirs==4.3.6
polars==1.15.0
prompt-toolkit==3.0.48
psutil==6.1.0
ptyprocess==0.7.0
pure-eval==0.2.3
pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.2
pyzmq==26.2.0
referencing==0.35.1
rpds-py==0.21.0
scikit-learn==1.5.2
scipy==1.14.1
setuptools==75.6.0
six==1.16.0
stack-data==0.6.3
statsmodels==0.14.4
threadpoolctl==3.5.0
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
typing-extensions==4.12.2
tzdata==2024.2
wcwidth==0.2.13
```
