+ Questions about the lab
+ Missing values
+ Outliers
+ Data inconsistency

# Missing values
- Examples
  - Hourly rows for daily/monthly columns in the weather data
  - Missing date of death for people who are still alive
  - Missing responses on surveys

***Other examples of missing values?***

***Ideas for handling missing values***


Handling missing values

  - Sometimes the answer is to replace it with 0, but not often
    - Depends on the type of missing value
    - Throwing in some other special value (e.g., -1, 99999) is rarely any better
  - Dropping missing values is sometimes the answer, but not always 
    - It was fine to drop the Daily *Precipitiation* values from the hourly rows in the weather csv
  - (Often) better solution: imputation (filling in missing data)
    - Mean value: replace all missing values with the mean of the non-missing data
      - Can be safe / low-impact imputation strategy
      - Can introduce problems; e.g., would not work for imputing date of death
    - Random value: pick a random non-missing value
      - Advantage: drawing from the existing (empirical) distribution of values
      - Strategy: randomly impute missing values several times, and see the extent to which it changes your analysis -- if too much, that's a red flag.
    - Nearest neighbor: pick the "closest row" based on the non-missing values, and use it's value for the missing one.
    - Interpolate from non-missing values (e.g., via linear regression). That is, "learn" a function that takes the non-missing values and predicts the missing value.
      - Can work well; can also create outliers
    - Heuristic values. E.g., missing year of death? Add 80 years to the date of birth.

# Outliers

- An outlier is a datapoint that is significantly separated from the main body of observations/data
- Several causes:
  - They can actual, valid observations/measurement. 
    - The "heavier tail" the distribution that the data comes from is, the more likely these are to appear. "Heavy tail" has more likelihood of things far from the mean appearing.
  - Data entry errors; e.g., punching in the wrong numbers
  - Fraud; e.g., tampering with the data
  - Instrument error; e.g., malfunctioning sensor
  - Imputation gone awry
*italicized text*
  

### Handling Outliers
- How to detect?
  - Visual inspection (e.g., make a histogram)
  - Look at the min/max values, verify them
  - Flag values more $k$ standard deviations from the mean (e.g., $k=1,2,3$).
- How to handle them?
  - Use methods that are robust to outliers (e.g., median over mean)
  - Exclude/drop them (not preferred unless they were due to errors)
    - In some cases, outliers may be the most important (e.g., earthquakes and building standards, flood mitigation)

### Consistency Issues

Can arise even within a single dataset, but are even more likely when combining information from multiple datasets. Can corrupt/invalidate your resulsts.

- Units
  - Know what your units are!
    - Common issues: English vs metric units
  - Stick to one set of units if at possible
  - To infer or detect issues, plot histograms, otherwise visualize
- Numerical representations
  - E.g., 16.5 vs 16 1/2 vs 16 vs sixteen
  - Resist temptation to "round" floats into integers (floats are your friend)
- Currency/financial unification
  - Different currencies, may need to apply exchange rates
  - Account for inflation
  - Stock splits

 Note:
 - Good practice to create new columns to store the transformed data, so you have a paper-trail of how the consistency issues were addressed.

***Question: what are some other consistency issues we might encounter? Brainstorm and compare notes with those around you.***


## Text data

### [Kaggle fake news dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?resource=download)

### [Spacy tutorial 1](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-spacy)

### [Spacy tutorial 2](https://towardsdatascience.com/analysis-and-visualization-of-unstructured-text-data-2de07d9adc84)



In [None]:
import pandas as pd


In [None]:
fake = pd.read_csv('Fake.csv')
real = pd.read_csv('True.csv')

In [None]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
real['date'].head()

0    December 31, 2017 
1    December 29, 2017 
2    December 31, 2017 
3    December 30, 2017 
4    December 29, 2017 
Name: date, dtype: object

In [None]:
set([type(date) for date in fake['date']])
f1 = '%B %d, %Y'
f2 = '%d-%b-%y'

In [None]:
datetimes = pd.to_datetime(fake['date'], errors='ignore', infer_datetime_format=True)

In [None]:
set([type(date) for date in datetimes])


{str}

In [None]:
formats = {1: '%d-%b-%y', 0: '%B %d, %Y', 2: '%b %d, %Y'}
# fake.iloc[9358]
from datetime import datetime
dates = []
# fake.drop([9358])
bad = []
for i, date in enumerate(fake['date']):
  try:
    if date[0].isdigit():
      f = formats[1]
    elif len(date.split()[0].strip()) > 3:
      f = formats[0]
    else:
      f = formats[2]
    dates.append(datetime.strptime(date.strip(), f))
  except:
    bad.append([date, format, i])

In [None]:
bad

[['https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/',
  <function format(value, format_spec='', /)>,
  9358],
 ['https://100percentfedup.com/video-hillary-asked-about-trump-i-just-want-to-eat-some-pie/',
  <function format(value, format_spec='', /)>,
  15507],
 ['https://100percentfedup.com/12-yr-old-black-conservative-whose-video-to-obama-went-viral-do-you-really-love-america-receives-death-threats-from-left/',
  <function format(value, format_spec='', /)>,
  15508],
 ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg',
  <function format(value, format_spec='', /)>,
  15839],
 ['https://fedup.wpengine.com/wp-content/uploads/2015/04/entitled.jpg',
  <function format(value, format_spec='', /)>,
  15840],
 ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg',
  <function format(value, format_spec='', /)>,
  17432],
 ['https://fedup.wpengine.c

In [None]:
fake = fake.drop([b[-1] for b in bad])

In [None]:
fake['Date'] = dates
fake['month'] = pd.DatetimeIndex(dates).month
fake['year'] = pd.DatetimeIndex(dates).year
fake['day'] = pd.DatetimeIndex(dates).day
fake.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23471 entries, 0 to 23480
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    23471 non-null  object        
 1   text     23471 non-null  object        
 2   subject  23471 non-null  object        
 3   date     23471 non-null  object        
 4   Date     23471 non-null  datetime64[ns]
 5   month    23471 non-null  int64         
 6   year     23471 non-null  int64         
 7   day      23471 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 1.6+ MB


In [None]:
fake[fake.year>2024]

Unnamed: 0,title,text,subject,date,Date,month,year,day


### Numerical Normalization

- Numerical data values are often wildly different in magnitude from column to column; depends on units and what is being measured
- It can help to *normalize* them into well-known, well-behaved ranges
  - Especially helpful for some predictive models

#### Types of normalization
- Standardization: makes a dataset zero-mean, unit-variance ($\sigma^2=1$, also means that $\sigma=1$).
$$ \hat{x}_i = \frac{x_i -\mu}{\sigma}$$
- These "z-scores" have some nice interpretability:
  - $\hat{x}_i < 0$, smaller than average
  - $\hat{x}_i > 0$, greater than average
  - $\hat{x}_i > 1$, more than one standard deviation above average
  - etc.
  - Can give context to how normal or anomalous a datapoint is

In [None]:
import pandas as pd
import seaborn as sns

data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]

# is an arm circumference of 40 (cm) big, little? 

mean = df['Arm Cir'].mean()
std  = df['Arm Cir'].std()
print((20-mean)/std)
sns.histplot(df['Arm Cir'])


- 0-1 normalization
$$ \hat{x}_i = \frac{x_i - x_{min}}{x_{max}-x_{min}}$$
  - Here $x_{max}$ and $x_{min}$ are the max/min values observed in the dataset -- ***or*** a theoretical min or max.
  - Warning: if a new datapoint comes along and you use the same mapping, can get values that are $<0$ or $>1$.
- To make values non-negative, can exponentiate:
$$ \hat{x}_i = e^{x_i}$$
  - $x_i \to -\infty$, normalized value approaches 0