# Wrangling Report

### June 28, 2022.

## Data Gathering
I gathered **all** three pieces of data for this project and loaded them in the notebook.

### Read first data from project directory (Twitter archived enhanced that is a csv file) into Dataframe.
* df_arc = pd.read_csv('twitter-archive-enhanced.csv')

### Extract image data with tweet_id from provided url 

* using the Requests library and saved obtained data into an image_predictions.tsv then loaded into a DataFrame


* df_img = pd.read_csv('image_predictions.tsv', sep='\t')

###  Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

### Checked the three DataFrames

* df_arc.head()
> head data of the archieved twitter data
* df_img.head()
> head data of the image twitter data
* df_twt.head()
> head data from twitter API


#### Quality
###### Archive (df_arc) Table

* Missing values in columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_timestamp, and expanded_urls.
* Column name floofer should be spelled 'floof'

###### Image_pred (df_img) Table
* The type of dogs in columns p1, p2 and p3 had some uppercase and lowerccase letters

#### Tidiness
* The column text had multiple variables like a url link, rating, and some tweets represented two dogs.
* The tweet_count and archive table should be merged as this is related data.

### Assessment Summary

#### Quality Issues

#### Archive Table (df_arc)
1. Missing values in columns: __in_reply_to_status_id__, __in_reply_to_user_id__, __retweeted_status_id__, __retweeted_status_user_id__, __retweeted_status_timestamp__, and __expanded_urls__.
2. Column name __floofer__ should be spelled __'floof'__ (but entire column values can be left as floofer)
3. __tweet_id__ has dtype int64 and should be object
4. __timestamp__ should be a datetime64 dtype type as well
5. Missing information for dog stages.
6. Many missing names from the list under __'Name'__, and random names like __'a'__ and __'an'__ might be parts of strings that got taken out of context.
7. Remove from table retweets and replies keepng only original tweets.
8. Some tweets had __"\&amp"__ combined with ";" which is the html code to display just the ampersand, so that needs to be cleaned up.
9. Some records have more than one dog stage
10. Rating_numerator column has values less than 10 (e.g 0, 7) as well as some very large numbers (e.g. 130, 110)
11. Rating_denominator column has values way other than 10 (way higher than 10)

#### Image prediction table (df_img)
1. The types of dogs in columns __p1__, __p2__, and __p3__ had some uppercase and lowercase letters.
2. The __tweet_id__ column should be dtype object instead of int64.

#### Tweet count table (df_twt)
1. The column __id_str__ should be changed to __tweet_id__ so merging tables will be smoother.

#### Tidiness Issues
1. The __tweet_count(df_twt)__  data (retweet_count and favorite_count) should be merged into the __twitter-archive-enhanced(df_arc)__ table as it is the added data for the __twitter-archive-enhanced__ table.
2. The __source__ column in the twitter-archive-enhanced table looks messy and clutters the table.
3. __doggo__, __floofer__, __pupper__, __puppo__ columns in __twitter_archive_enhanced__ table should be in one column named __Stage__.
4.  __df_arc__ without any duplicates (i.e. retweets) will have empty __retweeted_status_id__, __retweeted_status_user_id__ and __retweeted_status_timestamp__ columns, which can be dropped.
5. __"Breed"__ column should be added in __df_arc__ table; its values based on __p1_conf__ and __p1_dog__ columns of __df_img__ (image predictions) table
6. All three tables will eventually be merged into one.

## Cleaning Data
In this section, I cleaned **all** the issues I documented while assessing. 

I noted to make copies of each table first before cleaning as follows: df_arc, df_img, df_twt as arc_cl, img_cl, twt_cl.

### Issue #1: Some columns have missing values are some columns are not neccessary in my future analysis.

#### Define: 
Delete retweets and observations without ID, columns with missing values and delete unncessary columns: __'retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp', 'in_reply_to_status_id', 'source', 'expanded_urls', 'in_reply_to_user_id'.__

### Issue #2: There are ratings that do not have images, I only want ratings with images. 

#### Define
I would delete oobservations without images.

### Issue #3: Fix some column names
* twt_cl: unify column names
* arc_cl: fix column names

#### Define
* In the twt_cl table, the column name id_str would be changed to tweet_id using the rename() function.
* In the arc_cl table, column name floofer should be "floof" to match the dog stage associated with it using the rename() function.
* The columns rating_numerator and rating_denominator should be shortend to "rate_num" and "rate_denom" to make it less wordy.

### Issue #4: Fixing Datatypes.
* img_cl: tweet_id dtype "string"
* arc_cl: timestamp dtype "datetime"
* arc_cl: tweet_id dtype "string"

#### Define
* In the img_cl table, I would change the dtype of the tweet_id column from int64 to object using the astype() function.
* In the arc_cl table, I would change the dtype of the timestamp column from object to datetime using pandas to_datetime() function.
* In the arc_cl table, I would change the dtype of the tweet_id column from int64 to object using the astype() function.

### Issue #5: Uniform the dog breeds
* Img_cl dataframe

#### Define
* In the img_cl table, all the names of the dogbreeds in the __p1, p2, and p3__ would be converted to lowercase letters.

### Issue #6: Clean up text column in arc_cl dataframe

#### Define
* In the arc_cl table, I would change the html ampersand code from "&amp ;" to "&" in the text column
* I would alos remove the "/n " the newline symbol
* and also remove ending url link.

### Issue #7: Fix the ratings columns in the arc_cl table

#### Define
* In the arc_cl table, I would use methods like such as extractall(), query(), contains(), etc to check for misextraction of the ratings.

### Issue #8: Remove data with double ratings

#### Define
* In the arc_cl table, there are some tweets with two dogs being rated, therefore those will be dropped because it violates the rules of tidiness.

### Issue #9: Some records have more than one dog

#### Define

* There is one record that has both doggo and floof
* There is another record that has both doggo and puppo. 
> For these 2 records, I would take a look at the text manually to decide one dog stage for each of them. If I find ambiguous texts, I would set both the column values as None or drop the rows.

* There are 10 records which have both doggo and pupper. I would also decide this from reading the text manually and setting the appropriate stage programmatically.

##### Drop rows with double dog stages
* It is perceived as though they are two different dogs from my visual / manual investigation

#### Also drop name rows with __'a'__ as their names

### Issue #10: MERGE

#### Define
* Take both the arc_cl and twt_cl tables and merge into one table using the join() method on the columns tweet_id.

### Issue #11: FINAL MERGE

#### Define
* Take the newly tweet_data table and combine with the img_cl table using the same join() method on the tweet_id column.

### Issue #12: Removing missing Data

#### Define
* Removing the missing rows from the merged tables using the drop() method.

## Storing Data
* Here I saved the gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv"