# Data Wrangling and Collating

##### Data wranging and collating are large parts of the data sciecne proccess, with the abiity to correctly extract, transform and load data being known as (ETL) in the industry. Information can come in various forms, with a few of the most popular examples being:

> - JSON APIs
> - Text Scraping(Like TweetGen)
> - CSV Files (which can be dumps from)


##### Let's star out by looking at a few examples of each

##### One of the most common file structures that you will see in industry and work with will be CSV files

##### For our tutorial, we will take a look at the Titanic dataset

## CSV Files, Rows and Columns
##### CSV files constitute a large chuck of how data is respesented due to their organizational structure, and usage in industry. CSV files are commonly utilized as dumps for databases when their information must be represented in a format that can be exported to other programs or for backups.
##### This file structure is important becuase perfectly represeents a collection of EVENTS or occurences in our data, and answers quantity or quality questions about that singular event, such as:
##### Who?
##### What?
##### When?
##### Where?
##### How?
##### How much?
##### How Many?

##### Each ROW in our csv file represents a single occurence, and is represented horizonatally as follow:



In [14]:
import pandas as pd

df = pd.read_csv('train.csv')
df[1:2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


##### The tags that float above the rows, give a description to all of the values that every occurence contains. 
#### These are called columns and describe a single measured FEATURE or ATTRIBUTES of all of the rows in a given dataset

In [16]:
df['Name'].head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

## Header Rows

The first row of a csv file will sometimes contain the names of all of the columns. Panda automatically captures the first row (Zero indexed) and for its dataframe representation turns this into the "header". This being said, there will be times that you will have to either change, or set the header yourself.

This can be done executing:
> - df.rename (which will only rename the selected columns and leave the rest untouched)
> - df.columns (Which will reset ALL of the columns to whatever is contained in the strings in the list)

In [20]:
df_to_change = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

df_to_change

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [24]:
df_to_change =df_to_change.rename(index=str, columns={"A": "First", "B": "Second"})
df_to_change

Unnamed: 0,First,Second
0,1,4
1,2,5
2,3,6


In [26]:
df_to_change.columns =['A','B']
df_to_change

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


Discussion: 
What are some pertinent questions that you may have that could be turned into features?
Name some information from your own applications that would contain features that you could analyze?
Why would you want to change the columns headers on a dataframe?

## Data Dictionaries and describing your columns and features

##### Another part of he data science process involves labeling your features and attributes in a manner that not only gets the point accross of each feature, but also works as a way to verify that each features value is within the constraints of its definition

##### Generally the more descriptive your data dictionary is, the less trouble you will have with issues pertaining to taking care of null or NAN values. This will also be a boon to performing heavier statistical analysis on your data.

##### Here is an example dictionary for the titanic dataset for it's ease of orgnization and comprehension, a table is one of the most useful forms of display

<center><h2>Titanic Data Dictionary</h2></center>


| Variable | Definition                                        | Key                                            |
|----------|---------------------------------------------------|------------------------------------------------|
| Survived | Survival                                          | 0 = No, 1 = Yes                                |
| Pclass   | Ticket Class (proxy for socio-economic status)    | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                               | female, male                                   |
| Age      | Age (in years)                                    | 0.6 = 81                                       |
| SibSp    | # of siblings and/or spouses also aboard          |                                                |
| Parch    | # of parents / children also aboard               |                                                |
| Ticket   | Ticket number                                     |                                                |
| Fare     | Passenger fare (how much their ticket cost)       |                                                |
| Cabin    | Cabin number                                      |                                                |
| Embarked | Port of Embarkation (where they boarded the ship) | C = Cherbourg, Q = Queenstown, S = Southampton |



Discussion:
Why are data dictionaries important?
What are some possible ways that an unclear or incorrect data dictionary could affect your work in analysis.


# Data types

Dataypes in pandas come in some of the same great flavors that youre used to working with in regular python! This also means that these datatypes have all of the methods availible for use in your applications and analysis'. You can see the datatypes of a pandas columns( also known as a series) by using: df.dtypes


Here are a few that are availalble:

floats - which will display as "float64"
integers - which will display as "int64"
datetime -  objects which will display as "datetime64[ns]"
string strings, which will display as "object"


Dataypes assis you in making sure that your data is stored as a single type of value in any given column or series. Which is important becoase the operations that you will perform to index, transform or  perform analysis on require that all of the attributes in every row of a given column are the exact same

In [28]:
df_datatypes = pd.DataFrame({'int': [23],'float': [3.14], 'datetime': [pd.Timestamp('20181109')],'string': ['this is a string']})

display(df_datatypes.dtypes)

int                  int64
float              float64
datetime    datetime64[ns]
string              object
dtype: object