## WeRateDogs - Twitter Data

### I. Gather Data

Based on udacity instructions, I:

- downloaded the data **twitter-archive-enhanced.csv**.
- created my twitter developer account and tried to create a JSON file named **tweet_json.txt** by using the tweepy API.
    - did not work, so I downloaded the data manually
- downloaded the file image predictions file (tsv format).

With all the data, I created three dataframes:

- *archive_df* - dataset "twitter-archive-enhanced.csv" 

- *tweets_info_df* - dataset contains information like tweet_id, no of retweets and no of favorites etc.,

- *image_predictions_df* - dataset contains information about predictions about the image.


### II. Assesing the data


Standard data exploring with pandas funcitions like.
- df.info()
- df.isnull().sum()
- df["column"].duplicated()
- df["column"].value_counts()
- df["column"].unique()
- df["column"].describe()
- df.shape
- etc.


`Enhanced Twitter Archive`

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which  used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, we have filtered for tweets with ratings only (there are 2356).

`archive_df` columns and their description:
    
- **tweet_id**: the unique identifier for each of the tweet
- **in_reply_to_status_id**: the status id for the reply given to the tweet id
- **in_reply_to_user_id**: the status id for the reply given to the tweet id ( w.r.t user id)
- **timestamp**: Date and time the tweet was created, in Excel-friendly format.
- **source**: the web link as source
- **text**: the corresponding tweets text
- **retweeted_status_id**: the status id for the reply given to the tweet id i.e., for the retweeted id
- **retweeted_status_user_id**: the status id for the reply given to the tweet id ( w.r.t user id) i.e., for the retweeted id
- **retweeted_status_timestamp**: Date and time the tweet was created, in Excel-friendly format.
- **expanded_urls**: Expanded version of url1; URL entered by user and displayed in Twitter. Note that the user-entered URL may itself be a shortened URL, e.g. from bit.ly.
- **rating_numerator**: the ranking given by the user
- **rating_denominator**: The reference ranking given by the user 
- **name**: the breed or dog's name
- **doggo**, **floofer**,  **pupper**, **puppo** -- The stage of the dog

the link to the twitter data columns description can be found [here ](https://sfm.readthedocs.io/en/1.4.3/data_dictionary.html)

#### `Quality - archive_df`

#### `archive_df` table

1. Missing values in columns:
- in_reply_to_status_id 
- in_reply_to_user_id 
- retweeted_status_id
- retweeted_status_user_id
- retweeted_status_timestamp
- expanded_urls
2. 'rating_numerator' has incosistent values, e.g. 1776 as a max. 28 values are > 14 

3. 'rating_denominator' has inconsistent values, denominator should always be 10.

4. tweet id 835246439529840640 has a rating of denominator = 0

5.  weird names found for dogs - 
- "a"
- "not
- "one"
- "very"
- "o"
- "an"
- "all"
- "infuriating"

6. 'timestamp' should be a datetime object
7. 'retweeted_status_timestamp' should be a datetime object

8. Following columns should be integers or objects (strings) but definately not float:
-  in_reply_to_status_id
- in_reply_to_user_id
- retweeted_status_id
- retweeted_status_user_id


9. For missing values, the columns  'doggo', 'floofer', 'pupper', 'puppo' show None instead of NaN



  The link to the twitter data columns description can be found [here](https://sfm.readthedocs.io/en/1.4.3/data_dictionary.html)
  
  

  


  

#### `Tweets_info_df`

`Tweets_info_df` columns and their description:

- **tweet_id**: The unique identifier for each of the tweet
- **retweets**: The count of retweets done by user
- **favorites**: The count of favorites done by user
- **followers**: The count of number of followers
- **friends**: The count of number of friends

#### `Quality - tweets_info_df` table

- 14 tweet ids information is Missing 

#### `tweets_info_df` table

- looked pretty okay to me

#### Tidiness - `tweets_info_df`

- Retweets and Favorites has to be joined to the archive_df table since all tweet information is found within archive_df 

#### `Quality - image_predictions_df dataset:`

WeRateDogs Twitter archive ran through a neural network that classifies breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

`image_predictions_df` columns:
    
- **tweet_id**: tweet_id is the last part of the tweet URL after "status/"
- **jpg_url**: Image link or URL
- **img_num**: Image number
- **p1**: p1 is the algorithm's #1 prediction for the image in the tweet 
- **p1_conf**: p1_conf is how confident the algorithm is in its #1 prediction
- **p1_dog**: p1_dog is whether or not the #1 prediction is a breed of dog
- **p2**: is the algorithm's second most likely prediction
- **p2_conf**: is how confident the algorithm is in its #2 prediction
- **p2_dog**:  is whether or not the #2 prediction is a breed of dog 
- **p3**: p3 is the algorithm's #3 prediction for the image in the tweet
- **p3_conf**: p3_conf is how confident the algorithm is in its #3 prediction
- **p3_dog**: p3_dog is whether or not the #3 prediction is a breed of dog



### III. Cleaning

These steps were followed to clean the dataframe:


* Convert datatype of "tweet_id" into string

* Create a universe dataset joining all the dataframes based on the tweet_id

* Convert the dog stage or category into one column instead of the multiple columns

This caused dupliacted rows which were removed


* Converted columns stringin_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id to strings

*  Convert retweeted_status_timestamp into datetime object
    
* Unusual names for dogs - 'infuriating', 'just', 'life', 'light', 'mad', 'my' were changed to "No_name"
    
* retweeted_status_timestamp - had the null values which were dropped
    






### IV. Store

I stored the final dataframe into csv file with name **twitter_archive_master.csv** with final data of 2055 rows and 28 columns