## WeRateDogs Tweet Project Wrangle Report

The data we're working on in this project is from WeRateDog's tweets (@dogrates) in the timespan Nov 15, 2015 to Aug 1, 2017.

## Gathering
Our data comes from three different sources. Below are the sources and how I gathered data from each of them.

> **Note**  
The project was conducted using Jupyter Notebook and pandas version 1.4.2

### 1. WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

* A `csv` file from @dog_rates twitter archives was provided. 
* It was downloaded manually and loaded into a pandas dataframe using `pandas.read_csv`
* The file contained tweet data such as the tweet id, the tweet, url, and timestamp.

### 2. Tweet image prediction (image_predictions.tsv)

* A url to a `tsv` file was provided. 
* The file contained data on dog breed predictions based on images attached with the tweet.  
* The file was accessed using the `requests` library and its contents written into a `tsv` file.    
* The file was loaded into `pandas` using `pandas.read_csv` with a tab `\t` as the delimiter  

### 3. Extra tweet data via Tweepy (tweet_json.txt)

* More information on number of likes, quotes, replies, and retweets was needed.
* They were extracted using a Twitter Developer account and Tweepy library
- The tweet ids in our first file and the tweet field `public_metrics` were used to obtain them
* The extracted data was stored in a txt file and later converted into a dataframe using `pd.json_normalize()` 


## Assessing and Cleaning
Both visual and programmatic assessing were conducted.  
A number of issues were identified during the assessing stage.   
Below are the issues identified and how they were fixed.   
Note that copies of the data were made before cleaning was conducted.


### Quality issues

1. **Retweets in the data** -> **Delete the retweets**
* We only needed original tweets so the retweets(rows) and retweet-related columns were deleted

2. **Rows with missing Data** -> **Delete the rows**
* Tweets that had been deleted yielded rows full of `NaN` in the last dataframe. These rows were dropped.

3. **Invalid data** -> **Extract correct data**
* `very` and other non-names appeared as dog names and the string `None` a value representing unavailable data.
* Incorrect ratings were also present
* The correct was data extracted using regex and `None` strings replaced with `np.nan`

4. **Wrong data types** -> **Change to correct dtypes** 
* `astype()` was used to handle most conversions

5. **`Noise` in `source` column** -> **Useful data was extracted**
* Useful data was somewhat hidden among lots of unuseful information.  
  The useful data was extracted from the `noise`

6. **Long column names** -> **Shorten the names** 
* Unnecessary prefixes were removed using `.replace()`

7. **Inconsistent column names** -> **Rename Columns**
* Columns identifying tweet ids in different dataframes had varying names.  
  One of them was renamed to match the other. 


### Tidiness issues

1. **Same variable in separate columns** -> **Merged into one column** 
* The dog stages, reply status, and best dog predictions were spread over a number of columns.
* The dog stages were reextracted using regex and put into one column and the four separate columns deleted
* The best prediction of an actual dog breed was extracted them into a single column
* A new column was created that gives reply status ie whether a tweet is a reply or not and the two redundant columns deleted.


2. **Same observation in separate dataframes** -> **Merged dataframes** 
* `we_rate_dogs` and `extra_tweet_data` dataframes contained related info
* The two dataframes were merged