# Text Data Cleaning - Introduction

## Overview of Topics Covered:

## Main Notebooks:
### 0 - Introduction

* Python Strings
* Built-in String Methods
* Type Conversions
* Functions and Order of Operations

### 1 - Ask A Manager - Salary Survey

* *Messy* Data from Google Forms surveys
* ...Like, *really* messy
* CSV vs. Excel Files
* Converting Strings to Integers
* String Methods in Action
* `.apply()` and `lambda`
* Data Preprocessing
* Pandas `.merge()`

## Bonus Notebooks:
### 2 - Doctor Who - Actor Timeline

* Extracting text data from a Wikipedia table
* Pandas `.read_html()`
* Regular Expressions
* Timeline visualization

### 3 - Goodreads - Book Ratings

* Reading in a badly-formatted dataset
* Working with `bytes` objects
* Cleaning *before* loading into pandas

### 4 - Behavioral Risk Factor Surveillance System (BRFSS) 2014

* Extracting text data from PDF files using `pdfminer.six`
* Cleaning up excess whitespace in strings
* Using a dictionary to replace values in a column

# Python Strings

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## What are Strings?

Strings are basic units of text. They can contain any character. Python recognizes input as a string when it's enclosed in matching quotation marks.

Strings can be combined using the "+" operator. Say we have two string variables that denote the start date and end date of a process, and we want to print them out in a coherent sentence. We can do so like this:

In [None]:
start_date = '2024-05-01'
end_date = '2024-05-31'

#Notice the spaces around "to".
full_string = start_date + " to " + end_date

print(full_string)

### Coding Exercise: "+" in Different Contexts

In [None]:
string_var1 = 'book'
string_var2 = 'shelf'

int_var1 = 1
int_var2 = 2

print(string_var1 + string_var2)
print(int_var1 + int_var2)

################################################################################
################################################################################

# What happens if you add string_var1 and int_var1?
string_var1 + int_var1

################################################################################
################################################################################

### Escape Characters

"`\`" is called an "escape character" in Python (and in markdown cells). Placing an escape character before another character in a string will cause a different behavior from the character by itself. In Python strings, "\\t" represents a tab, and "\\n" represents a newline character.

Also, in order for a "\\" to show up correctly in markdown cells, it has to have another \\ in front of it. "\`" (grave) is another special character that functions as an escape character in some contexts in markdown, but not in code cells.

Double-click in this cell to see how many backslashes and graves there actually are in the markdown text.

In [None]:
print('line 1\n\n\n\n\n\n\tline 2 (tab-indented)')

## String Methods

String objects in Python have built-in functions called "methods" that allow specific operations to be performed without the need to write additional code. There are over 40 string methods, each of which with its own specific task.

Like all methods, these are accessed by using the dot operator after the string variable, the name of the method, and open-and-closed parentheses, i.e.: `string_name.method()`.

https://www.w3schools.com/python/python_ref_string.asp

In [None]:
string_var3 = "My parents and I moved back to California from Berlin in September, 1989. \nI had just turned three years old."

### Case manipulation

In [None]:
print(string_var3)

In [None]:
print(string_var3.upper())

In [None]:
print(string_var3.lower())

In [None]:
print(string_var3.capitalize())

In [None]:
print(string_var3.title())

In [None]:
print(string_var3.swapcase())

### Properties

In [None]:
print(string_var3.isalpha())

In [None]:
print(string_var3.count('i'))

### Segmentation

In [None]:
print(string_var3.split())

In [None]:
print(string_var3.split(', '))

In [None]:
print(string_var3.splitlines())

### Replacing Values

In [None]:
print(string_var3.replace('1989', '2015'))

In [None]:
print(string_var3.replace(' ', ''))

In [None]:
print(string_var3.partition('1989'))

### Coding Exercise: String Methods

Type your own sentence in the following cell, then try out different string methods on it to see how they affect the string.

In [None]:
################################################################################
################################################################################

string_var4 = "___"
print(string_var4)

################################################################################
################################################################################

## Converting Types

`str()` lets you change an integer or a float into a string. Python also has `int()` and `float()` for converting to integer and floating point data types, respectively.

In [None]:
print(int('3'))
print(float('3'))
print(int(3.0))
print(int(3.8))
print(float(3))
print(str(3.8))

### Coding Exercise: Strings and Quotation Marks

Try replacing the double-quotes around the string with single-quotes.

In [None]:
################################################################################
################################################################################

string2 = "This doesn't work unless you use double-quotes to open and close the string."
print(string2)

################################################################################
################################################################################

## Functions

Here are a few functions that will let us manipulate strings.

https://docs.python.org/3.5/tutorial/controlflow.html#documentation-strings

In [None]:
def separate_words(sample_string: str, delimiter=' ') -> list:
	words = sample_string.split(delimiter)
	return words

def add_elipses(sample_string: str) -> str:
  return(sample_string+'...')

def join_words(sample_list: list, delimiter=' ') -> str:
	title = delimiter.join(sample_list)
	return title

input_string = 'Please speak more slowly'
join_words([add_elipses(word) for word in separate_words(input_string)])

### Coding Exercise: Functions as an Assembly Line

Try using each function (`separate_words()`, `add_elipses()`, `join_words()`) by itself, using the same input string. Combine more than one function, and change the order in which the functions are applied.


In [None]:
################################################################################
################################################################################

join_words(input_string)

################################################################################
################################################################################