# Data Wrangling Process

![WeRateDogs](https://cdn.shopify.com/s/files/1/1352/9125/files/wrd_logo_website-200h_540x.png?v=1599580382)

# The Dataset

## WeRateDogs

### Delimited Data
Weratedogs is a novelty Twitter account that rates images of dogs. I was provided a url for an archive of Tweet data stored in a CSV that was linked, a machine learning image predictor dataset as a linked tsv. I read both of these files from their url into Pandas dataframes.

### Tweepy and Twitter

In order to ensure data was accurate, I needed to pull each Tweet from the delimited data. Using `Tweepy`, I wrote multiple JSON objects to a file by calling the twitter API. Each line of the Tweet info was written into a nested list that was read into a DataFrame.

# Assessing Data

### Visually

Each dataset has forms of missing or inaccurate data. Reading a few lines you could see lots of NaN values. Categorical data like dog type was split across 4 different columns. Also, there were 3 datasets relating to potentially 1 observation.

### Programaitcally assessing

Using `df.info()`, `df.describe()` and some other dataframe items, I was able to see columns that had incorrect data types. For example, `tweet_id` was an integer in all the dataframes. Correcting this to a string symbolically related its value as a 'name' instead of a value and allowed easier calling of the Twitter API OMbed later on.

The image dataframe and the tweet dataframe were also not the same size. This means that some tweets were gone or not parsed by the machine learning algorithm that made the image TSV file. 

Reading some values, I was able to document 2 tidiness issues upfront and 9 different cleaning issues.

# Cleaning

Using a variety of techniques, I corrected data types, values, joined tables. The attached Wrangle Act has further detail on what explicit steps were needed. A few libraries that were used to clean included the following:

Pandas
regex
requests
Tweepy
Json

A note before describing the methods below, I chose to clean the data before tidying the structure. Addressing the structure first would have sped the process up and shown fewer errors but may not have worked if a different set of tables would be joined in the future. Typically tidying the data first would be best practice if I knew the total scope of analysis before cleaning.

### Issues and Methods

#### Issue 1: tweet ID is an integer

Using `astype.str()` I changed the type value of `tweet_id` in all 3 dataframes to a string. Since a tweet id is more like a name than a calculation, this datatype allowed it

#### Issue 2: Timestamp not a datetime dtype

Using `to_datetime()` method, converted this column in the dataframe to a proper descriptive type

#### Issue 3: 181 retweets

Using the `drop()` method, all retweets were removed. Since these were not sourced directly from WeRateDogs, they were removed.

#### Issue 4: Invalid denominator in some rows. Each denominator needs to be 10, reduced to 10 or dropped.

##### 4.1: Replace al denominators == 0

Using `df.at()` I replaced the single 0 denominator to the correct value

##### 4.2: Fix other non 10 denominators in table

I created a list of indexes for rows denominators that were not 10. Viewing these rows, I manually reviewed some tweets which had no rating data and dropped them from the dataframe.

Another index list was created and a numerator was iterated on to update to the appropriate values.

Visually, I could see the remaining rows were all divisible by 10. I converted the index values to a list to iterate over. I iterated over these values finding the appropriate rating of n/10

#### Issue 5: Invalid Dog Names

Using str.lower() method I created a list of all rows that had a name that was not a proper noun. I added the name None to this list. I ran a loop to find if the `text` had a 'named' or 'name is'. I used these indexes in the string to populate names if I could find them and otherwise replace names with an empty string. Using pd.NaN might have been more appropriate but I was recieving an error that is documented in the code lines.

#### Issue 6: Invalid Links from vine.co

Vine is a service owned by Twitter that was shut down. There still exist a data archive but it only contains text, and users are able to delete their content. I removed every line that sourced vine.co to ensure the data that was being worked with was readily verifiable.


#### Issue 7: Invalid numerator entries

I used `df.describe()` on the `rating_numerator` column to get an overview of the makeup of this columns data. I noticed some strangeness and sought to address them with manual and programatic cleaning

##### 7.1: Novelty Entries

A few novelty entries slipped into the dataset and were removed using `drop()` individually, such as 666 and other pop-culture references. I then searched though a few subsets of data manually and removed many rows that were not dog ratings. 

##### 7.2: Decimal values in numerator instead of real values

Once I found a majority of posts were dogs instead of other types of tweet at the rating 7 mark, I moved on to replacing ratings that had decimal values into rounded integers. For example, in the initial dataframe, a dog rated 11.27/10 was listed as 27/10. I looped through the entire dataframe looking for values that conformed to this regex `'\d+\.\d\d\/10'`. I created a list of indexes of the rows that fit this mold that also contained the string value `[[index, regex]]`. Using `split('/')`, `int()` and `round()` I extracted the numerator as a rounded integer. I chose a a rounded integer as it followed the structure of the rest of the data. 

Again, looping through the index of the lines that met the regex expression, I applied the appropriate integer to numerator for that location

#### Issue 8: Some columns may be renamed to improve clarity in df_image

##### 8.1: `img_num` not descriptive

In this section I reviewed what this column meant by assessing `value_counts()` and reviewing manually a few tweets to determine that it described the number of images attached to a tweet. I therefore renamed the column to `num_images`.

##### 8.2: p1-p3_dog are not descriptive

I used a description left by the creator of the files to rename these columns to help with understanding. In this section, I also opted to rename the `rating_numerator` to `rating_out_of_10` and drop `rating_denominator`. This allowed the dataset to maintain it's meanings but also lower it's overall size


#### Issue 9: Dog breeds can be standardized

In the `#_guess` columns, I used `str.lower()` to enforce a lowercase string on all dog breed strings. This would allow easier grouping.

### Tidiness 1: Join all tables

Using `pd.df.merge()` inner joins, the 3 datasets were combined, This created the overall smaller table mentioned in the note at the top. DOing this item first would have saved time with cleaning but maybe would have made future analysis more difficult if new tables were added.


### Tidiness 2: Create a `type` column to store the doggo type

The `doggo` type was stored across 4 columns. I created a new `type` column and iterated over the entire dataset to append the string of the 4 single type columns.


# Storing Data:

The new cleaned and combined dataframe was stored to the CSV specified in the instructions. This csv was then reloaded to a new dataframe set. WHile more memory intensive, this copy would ensure changes would not affect the source dataframe.

One note is that `tweet_id` reverted to an integer when being converted to csv and I manually reset this the data type to string using `astype()`

# Visualization
Using matplotlib, I did some small visual analysis to determine the relationship between Charlie dog ratings and the mean. I also reviewed how strong the relationship between retweet and likes where in a linear regression model. Both of these are ploted in the file.

In [1]:
import os

os.system('jupyter nbconvert --to html wrangle_report.ipynb')

[NbConvertApp] Converting notebook wrangle_report.ipynb to html
[NbConvertApp] Writing 571485 bytes to wrangle_report.html


0