## Data Gathering
I gathered data from three sources. I manually downloaded the twitter archive data, `twitter-archive-enhanced.csv`.
I programmatically downloaded the image prediction data, `image-predictions.tsv` using the Python request library.
I used the Twitter API to gather data about `favorite count` and `retweet count`.


The `twitter-archive-enhanced.csv` contains data about tweets from `WeRateDogs` from 2015-2017. Here's a preview of the `twitter-archive-enhanced.csv` data

In [3]:
import pandas as pd
pd.read_csv('twitter-archive-enhanced.csv').head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


The `image-predictions.tsv` data contain information about the images from the tweet. It shows a neural network's prediction of the images in each tweet. It also shows the confidence score of each prediction. Here is a preview of the dataset.

In [4]:
pd.read_csv('image-predictions.tsv', sep='\t').head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


I gathered data about the favorite count and retweet count from the Twitter API. I loaded this data into a Pandas DataFrame. Here is a preview of the data.

In [9]:
pd.read_csv('tweets_api_data.csv').head(3)

Unnamed: 0.1,Unnamed: 0,favorite_count,retweet_count,tweet_id
0,0,39467,8853,892420643555336193
1,1,33819,6514,892177421306343426
2,2,25461,4328,891815181378084864


## Assessing

### Quality issues

Here are some quality issues I noticed in the data via visual and programmatic assessment.

**twitter_archive data**

1. Missing values in `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `expanded_urls` and `retweeted_status_timestamp` columns.

2. `tweet_id` is an `integer` instead of an `object`. The variables `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` are of the type `float` intead of `object`.

3. There are some `None` values in the `name` variable.

4. The `timestamp` variable is of the *object* datatype instead of *datetime*.

5. Retweeted data is included in the dataset.

6. The values `a`, `an`, `the`, and `just` in the `name` column are not dog names.

7. HTML formatting in the `source` column.


**image_prediction data**

7. `tweet_id` is an integer instead of object.

8. Not all animals in the dataset are dogs. Some are hen, snail, etc

9. Underscores (`_`) in dog breed name. Inconsistent breed names.


### Tidiness issues

1. In the `image prediction data` DataFrame, `p1_dog`, `p2_dog`, `p3_dog` contain the same information. They tell us if the object in the image is *a dog*. There should be just one column that tells us the breed of dog in the image based on the confidence score of the neural network in `p1_conf`, `p2_conf`, and `p3_con`.

2. There are four different columns for the *dog's status*: `dodoggo, floofer, pupper, puppo`. 

*Others*

3. The `image prediction data` and the `tweets api data` DataFrame can be merged as one.
4. Multiple URLs in the `expanded_urls` column of the `df_twitter_archive` dataframe.


## Cleaning

1. Retweeted data is included in the dataset. Not all the tweets are about dog ratings and some of them are retweets. If the value in the `retweeted_status_id` is not null, it means that tweet is a retweet. It should be removed from the dataframe. I filtered the data to show records were the `retweeted_status_id` is null. These records are not retweets.


2. There are missing values in the following columns `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `expanded_urls` and `retweeted_status_timestamp`. These columns won't be useful now. So I dropped them.


3. The `tweet_id` variable is of the type *integer* instead of *object* in both the `twitter archive data` and `image_prediction data`. I converted these variables to the `object` datatype. I won't be using them for any sort of calculation since they are primary keys. 


4. The `timestamp` variable is of the *object* datatype instead of *datetime*. I converted it to *datetime* format.


5. The values `a`, `an`, `the`, and `just` in the name column of the `twitter archive data` are not dog names. I replaced these values with `None`, since the names are unknown.

6. The is HTML formatting in the `source` column of the `twitter archive data`. I had to remove these formatting. After removing the formatting, the actual source of the tweet could be seen.


7. There are four different columns for the dog's status: dodoggo, floofer, pupper, puppo. This is a structural issue. Also, some dogs are `doggo-pupper`, while others are `doggo-puppo`. I created a new column `dog_status` to show if the dog is doggo, floofer, pupper, puppo, doggo-pupper or doggo-puppo.


8. Not all animals in the `image prediction data` are dogs. I removed the records of animals that are not dogs. I used the `p1_dog`, `p1_conf`, `p2_dog`, `p2_conf`, `p3_dog` and `p3_conf` variables to do this. First, I checked the prediction with the highest confidence score between `p1_conf`, `p2_conf`, and `p3_conf`. Then I checked the corresponding value of `p*_dog` to know if it was `TRUE` or `FALSE`. I created a new column `is_dog` to tell if the image is a dog or not. `TRUE` means the image is a dog, `FALSE` means the image is not a dog. I removed records that were not dog images from the `image prediction data`.



9. There should be just one column that tells us the breed of dog in the image based on the confidence score of the neural network in `p1_conf`, `p2_conf`, and `p3_con`. I created a new column `dog_breed`. It is based on which column has the highest confidence score between `p1_conf`, `p2_conf`, and `p3_conf`. The dog breed is the corresponding value of `p*`. For instance, if `p1_conf` has the maximum confidence score, then the dog breed is `p1`. I created a function to do this.



10. I removed unnecessary column from the `image prediction data`. I removed the columns `p1_conf`, `p2_conf`, `p3_conf`, `p1`, `p2`, `p3`, and `is_dog`.


11. I fixed inconsistencies in the `dog_breed` column. I replaced the underscore character `_` with a space ` `. I ensured that all the texts are in lower case.



12. I merged all the three cleaned DataFrames on the `tweet_id`, and saved it in a file called "twitter_archive_master.csv".