# Preliminary Imports

In [8]:
import pandas as pd
import numpy as np
from random import randrange
from functools import reduce
from tail_recursive import tail_recursive

# Part 1: Think About The Data

## Part 1.1

Our dataset offers childcare prices from the year 2008 to 2018 by childcare provider type, age of children, and county characteristics. This data set is of particular interest to us because it provides meaningful insight into how childcare costs have changed over the years and how it was affected by changing governing bodies in the United States. Based on such interpretation, the data set can be an effective tool for present and future parents and policymakers when making a decision that will impact childcare prices.

## Part 1.2

There are a total of 62 numerical attributes and 3 categorical attributes in the data set.

## Part 1.3

There are missing values in our data set, often for specific age groups or entire counties. 

(Complete tomorrow during the meeting (9/26))


## Part 1.4

We believe that the “unemployment rate of the population aged 20 to 64 years old”, “median household income expressed in 2018 dollars” and “county identifier”will be the most descriptive of the data. This is because we believe that the subgroup of relatively high median household income and low unemployment rate will inevitably be able to afford high-quality childcare, thus leading to purchasing higher-priced childcare. Similarly, subgroups of relatively low median household income and high unemployment rate will lead to lower-priced childcare. For county identifier, we believe that as different states have different policies regarding childcare, this will inevitably lead to subgroups of states / counties with similar policies.

## Part 1.5

We believe that there will be clusters in the data because some of the attributes, like those mentioned in the previous section, are heavy influences of childcare prices, thus we expect there to be clusters of similar groups of those attributes (i.e., unemployment rate).

## Part 1.6

By finding clusters in the data, we can gain an understanding of the influential attributes that have led to the clusters. This is massively help parents and policymakers to understand how their decisions might affect childcare prices. Furthermore, with new instances, we can have a better understanding of which cluster it will fall into. 

## Part 1.7

(NO IDEA)

## Part 1.8

This will depend on what type of clustering algorithm we implement. However, in general, we do not expect that the clusters will be of similar size. This is because our dataset contains a large number of numerical attributes. Furthermore, taking the example of county identifier as an example, the number of states / counties that share similar childcare policies will not be evenly distributed, thus leading to clusters of different sizes.

In [9]:
df = pd.read_csv("childcare_costs.csv")
df.index.to_series().map(lambda i: df.iloc[i].isna().sum() / len(df.iloc[i])).to_csv("nancount.csv")

# Part 2: Perform Some Explanatory Data Analysis

## Part 2.1

In [None]:
import pandas as pd
import numpy as np
import openpyxl as xls


## Part 2.2

## Part 2.3

## Part 2.4

## Part 2.5

## Part 2.6

## Part 2.7

## Part 2.8

## Part 2.9

# Part 3: Write Functions For Clustering in Python

## Part 3.1: K-Means Clustering Algorithm

### Preliminaries

I am using a small data set to test on. This data corresponds to soybean data.

In [10]:
df = pd.read_csv("soybean_data.csv", index_col=0)

### Create Distance Matrix Function

In [11]:
def distMatrix(df, centroids):
    df_mat, cent_mat = df.to_numpy(), centroids.to_numpy()
    return np.array([[np.linalg.norm(df_mat[i, :] - cent_mat[j, :]) for j in range(cent_mat.shape[0])] for i in range(df_mat.shape[0])])

### Assign New Clusters

In [12]:
def cluster(dm):
    def f(i):
        return reduce(lambda x, y: x if x[0] <= y[0] else y, zip(dm[i, :], range(len(dm[i, :]))))[1]
    return f

### Bringing It All Together

In [13]:
def kmeans(df, k, eps):
    @tail_recursive
    def go(currentClus, prevClus, r):
        print("Iteration Number {}".format(r))
        centroids = pd.DataFrame(df.to_dict(orient='series') | {"Cluster": currentClus}).groupby(by=["Cluster"]).mean()
        if prevClus is not None and ((currentClus != prevClus).sum() / len(currentClus)) < eps:
            return centroids
        else:
            dm = distMatrix(df, centroids)
            return go.tail_call(df.index.map(cluster(dm)), currentClus, r + 1)
    return go([randrange(k) for i in range(df.shape[0])], None, 0)

In [14]:
kmeans(df, 4, 0.001)

Iteration Number 0
Iteration Number 1
Iteration Number 2
Iteration Number 3
Iteration Number 4


Unnamed: 0_level_0,Date,Plant-Stand,Precip,Temp,Hail,Crop-Hist,Area-Damaged,Severity,Seed-TMT,Germination,...,Int-Discolor,Sclerotia,Fruit-Pods,Fruit Spots,Seed,Mold-Growth,Seed-Discolor,Seed-Size,Shriveling,Roots
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.4,1.0,1.866667,0.533333,0.266667,1.333333,1.133333,1.4,0.533333,0.8,...,0.0,0.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.733333
1,4.7,0.0,0.0,1.6,0.6,1.6,2.5,1.0,0.5,0.9,...,2.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.5,0.0,2.0,1.0,0.1,1.9,0.3,1.3,0.5,1.3,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2.416667,0.833333,1.833333,0.166667,0.333333,2.166667,1.0,1.833333,0.416667,1.583333,...,0.0,0.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.583333


## Part 3.2

# Part 4: Analyze Your Data

## Part 4.1

## Part 4.2

## Part 4.3

## Part 4.4