## Reporting: wrangle_report

# Report for Udacity Project Wrangle and Analyze Data


## WeRateDogs Twitter Archive Analysis
 
This project involves wrangling data from WeRateDogs Twitter archive. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. The datasets were retrieved from three different sources and file formats. Below are the different sources:

### Enhanced Twitter Archive
This dataset comes in a CSV file sent to Udacity by WeRateDogs to be used in this project. It contains tweet data such as tweet ID, timestamp, text, source, name, etc. The tweets found on this dataset were created on or before August 1st, 2017.

### Twitter API
This dataset contains the retweet_count and favorite_count and was obtained by querying Twitter API using the Python Tweepy library. The tweet ID from the Twitter archive dataset was used to retrieve similar tweet data in JSON format from Twitter API. 

### Image Predictions File
The Image Prediction file comes in TSV file format. It contains tweet ID, image url and top three image predictions of dogs from the Twitter archive dataset using a neural network that can classify breeds of dogs.

There are 3 processes involved in wrangling the datasets which include:
- Gathering
- Assessing
- Cleaning

### Gathering
- **Enhanced Twitter Archive**: This dataset was downloaded manually through this [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv). After downloading the dataset it was read into a pandas dataframe using the read_csv function.
```Python
# Loading the twitter dataset in a dataframe
df_tweet = pd.read_csv('twitter-archive-enhanced.csv')
```
- **Image Prediction file**: The image prediction file which is in tsv format was downloaded programmatically from the internet with this [url](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) into a new folder using a Python's library called requests. 
```Python
response = requests.get(url)
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
```
After downloading and saving the file in a folder, it was then read into a pandas dataframe
```Python
# Loading the downloaded image file into a dataframe
df_image_prediction = pd.read_csv('my_project/image-predictions.tsv', sep='\t')
```
- **Twitter API**: The twitter api is a closed source api, twitter requires an offical application in order to grant elevated access to their api. After getting the elevated access I was given an api_key, api_secret_key, access_token and access_secret. With the api keys I was able to connect and make call on the twitter api
```Python
# Connecting to the Twitter api
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
```
After connecting, downloading, and saving the tweets in a tweet_json.txt file. The dataset was then loaded into a dataframe
```Python
# Loading the tweet dataset into a dataframe
df_tweet_api = pd.DataFrame(df_tweet_list, columns=['tweet_id', 'retweet_count', 'favorite_count'])
```

### Accessing
After assessing the three datasets visually and programmatically. I found 8 quality issues and 3 tidiness issues:

#### Quality issues
- Four quality issues were found in the Twitter archive dataset
- Four quality issues were also found in the Image prediction dataset

#### Tidiness Issues
- Two tidiness issues in the Twitter archive dataset
- One tidiness issue involves combining the three datasets to form one master dataset.


### Cleaning
Before the cleaning process, copies of the three datasets were made. The cleaning was performed on the copies not the original. The following steps were taken to clean the dataset:
- Removing rows with null values
- Removing rows that are confirmed to be retweeted and replied tweets
- Dropping rows with duplicates
- Dropping irrelevant columns
- Changing erroneous datatype to correct datatype
- Joining two or more variables into one 
- Merging datasets together to form a complete observational unit.

### Storing

After the whole wrangling process, the cleaned datasets was merged and stored in a master dataset file called twitter_archive_master.csv