# 1-D Exploratory Data Analysis

In this notebook, do some EDA in one dimension. Pick a column  (or a set of columns) you're interested in looking at. Calculate some summary statistics (like mean,median,min,max,sd). Then, make some plots to visualize the distribution of the data. Distirbution plots include things like histograms, boxplots, dotplots, beeswarms, and violin plots. Review [ggplot-intro](https://github.com/data4news/ggplot-intro) for examples of these kinds of distribution plots.

### Standard Python and R imports

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 


In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Loading required package: tidyverse


## Load the data

In [4]:
%%R
 
# Import data with R
df <- read_csv('YES.csv', show_col_types = FALSE)
df %>% head(4)

# A tibble: 4 × 1
  GEOIDCORRECT;NAME_x;variable;estimate;Name;Location;Open Year-Round;Handicap…¹
  <chr>                                                                         
1 36061006900;Census Tract 69, New York County, New York;population;2450;James …
2 36061006900;Census Tract 69, New York County, New York;poverty;2126;James J. …
3 36061006900;Census Tract 69, New York County, New York;med_inc;237500;James J…
4 36061018300;Census Tract 183, New York County, New York;population;8578;River…
# ℹ abbreviated name:
#   ¹​`GEOIDCORRECT;NAME_x;variable;estimate;Name;Location;Open Year-Round;Handicap Accessible;Borough;Comments;Latitude;Longitude;SUFFIX;POP100;GEOID;CENTLAT;BLOCK;AREAWATER;STATE;BASENAME;OID;LSADC;INTPTLAT;FUNCSTAT;NAME_y;OBJECTID;TRACT;CENTLON;BLKGRP;AREALAND;HU100;INTPTLON;MTFCC;LWBLKTYP;UR;COUNTY;SUFFIX.1;POP100.1;GEOID.1;CENTLAT.1;BLOCK.1;AREAWATER.1;STATE.1;BASENAME.1;OID.1;LSADC.1;INTPTLAT.1;FUNCSTAT.1;NAME.1;OBJECTID.1;TRACT.1;CENTLON.1;BLKGRP.1;AREALAND.1;HU1

One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat) 


## Summary statistics

Pick a column or set of columns and calculate some summary statistics (like mean,median,min,max,sd).
Hint, you may want to use `group_by` and `summarize`.



In [None]:
%%R 

# code for summary statistics



In [None]:
%%R

discrete_variables <- c('vs', 'am', 'gear', 'carb')
# 👉 Select the discrete variables only and make a pivot table for each
# so we know how many cars there are in each category (for example, how many automatic vs manual)?

mtcars %>% 
    select(discrete_variables) %>%
    pivot_longer(discrete_variables, names_to = "variable", values_to = "value") %>% 
    group_by(variable, value) %>% 
    summarize(
        count = n()
    )

## 1-D visualizations (aka distributions)


### Continus variables

For each continuous variable you are interested in, use ggplot to make a plot of the distribution. You can use histograms, dot plots, box plots, beeswarms, etc...(whichever chart type you found most useful). Learn about that variable and give each chart a headline that explains what you're seeing. The chart can also show the mean or median of the variable for reference (for example for a histogram you can add a vertical line through the median).

In [None]:
# code for plot 1
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 2
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 3
# make sure to make a meaningful title and subtitle

### Discrete Variables

If there are any discrete variables you'd like to analyze, you can do that with charts here.

In [None]:
# code for plot 1
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 2
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 3
# make sure to make a meaningful title and subtitle