Please use this structure for your report, but you do not have to
slavishly follow this template. All bullet points are merely suggestions
and potential points to discuss in your writeup. Your report should be
no more than 12 pages, including figures. Do not include *any* code or
code output in your report. Indicate your informal collaborators on the
assignment, if you had any.

# Introduction

Things to potentially include in your introduction:

-   Describe the problem of interest and put your analysis in the domain
    context. Read the introduction of the two Nerbonne and Kertzschmar
    papers for some help here.

-   What do you aim to learn from this data?

-   Outline what you will be doing in the rest of the report/analysis

# The Data {#data}

-   What is the data that you will be looking at?

-   Provide a brief overview of the data

-   How is this data relevant to the problem of interest? In other
    words, make the link between the data and the domain problem

In [None]:
pip install --upgrade numpy scipy seaborn matplotlib


Collecting numpy
  Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.1
    Uninstalling numpy-2.1.1:
      Successfully uninstalled numpy-2.1.1
Successfully installed numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from pyreadr import read_r
import numpy as np

# Load the data
ling_data = pd.read_csv('../data/lingData.txt', sep='\\s+')
ling_location = pd.read_csv('../data/lingLocation.txt', sep='\\s+')

# ling_data has a column for each question, and ling_location has a column
# for each question x answer.  Sorry the columns in ling_location are not usefully named,
# but it's not too tricky to figure out which is which.
# Note that you still need to clean this data (check for NA's, missing location data, etc.)

# Load the question_data which contains quest.mat, quest.use, ans.---
question_data = read_r('../data/question_data.RData')

In [None]:
# Inspect column names
print(ling_data.columns)
print(ling_location.columns)

# Load state geometries
state_df = gpd.read_file('../data/shapefiles')
state_df = state_df[state_df['iso_a2'] == 'US']  # Filter to only US

Index(['ID', 'CITY', 'STATE', 'ZIP', 'Q050', 'Q051', 'Q052', 'Q053', 'Q054',
       'Q055', 'Q056', 'Q057', 'Q058', 'Q059', 'Q060', 'Q061', 'Q062', 'Q063',
       'Q064', 'Q065', 'Q066', 'Q067', 'Q068', 'Q069', 'Q070', 'Q071', 'Q072',
       'Q073', 'Q074', 'Q075', 'Q076', 'Q077', 'Q078', 'Q079', 'Q080', 'Q081',
       'Q082', 'Q083', 'Q084', 'Q085', 'Q086', 'Q087', 'Q088', 'Q089', 'Q090',
       'Q091', 'Q092', 'Q093', 'Q094', 'Q095', 'Q096', 'Q097', 'Q098', 'Q099',
       'Q100', 'Q101', 'Q102', 'Q103', 'Q104', 'Q105', 'Q106', 'Q107', 'Q109',
       'Q110', 'Q111', 'Q115', 'Q117', 'Q118', 'Q119', 'Q120', 'Q121', 'lat',
       'long'],
      dtype='object')
Index(['Number of people in cell', 'Latitude', 'Longitude', 'V4', 'V5', 'V6',
       'V7', 'V8', 'V9', 'V10',
       ...
       'V462', 'V463', 'V464', 'V465', 'V466', 'V467', 'V468', 'V469', 'V470',
       'V471'],
      dtype='object', length=471)


## Data Cleaning

-   This dataset isn't as bad as the TBI data, but there are still some
    issues. You should discuss them here and describe your strategies
    for dealing with them.

-   Remember to record your preprocessing steps and to be transparent!

## Exploratory Data Analysis {#data-exploration}

-   This is where you compare pairs of questions with discussion and
    plots.

# Dimension Reduction

-   This is where you discuss and show plots about the results of
    whatever dimension reduction techniques you tried---PCA, variants of
    PCA, t-SNE, NMF, random projections, etc.

-   What do you learn from your dimension reduction outputs

-   Discuss centering and scaling decisions

# Clustering

-   This is where you discuss and show plots about the results of
    whatever clustering methods you tried---k-means, hierarchical
    clustering, NMF, etc.

# Stability of findings to perturbation

-   What happens to your clusters when you perturb the data set?

-   What happens when you re-run the algorithm with different starting
    points?

# Conclusion

-   Discuss the three realms of data science by answering the questions
    in the instructions pdf.

-   Come up with a reality check that would help you to verify your
    clustering. You do not necessarily have to perform this reality
    check, but you can if doable.

-   What are the main takeaways from your
    exploration/clustering/stability analysis?

# Academic Honesty {#academic-integrity-statement}

## Statement

Please include your academic honesty statement here. Do NOT include your
name.

## LLM Usage

If, in accordance with the policy in `lab2-instructions.pdf`, you used
the one exception to the LLM ban to help complete the lab, please see
the instructions for what to write here.

## Collaborators

List your collaborators here.


# Bibliography

Include any references you used in your report here.