# Data Wrangling Report


## Table of Contents
<ul>
<li><a href="#obj">Project objectives</a></li>
<li><a href="#gathering">Gathering Data</a></li>
<li><a href="#acd">Assessing and Cleaning Data</a></li>
    <ol>
        <li><a href="#quality">Quality Issues</a></li>
        <li><a href="#tidy">Tidiness Issues</a></li>
    </ol>
<li><a href="#result">Result</a></li>
</ul>

<a id='obj'></a>
<font color=#0877cc size=4><b>Project objectives</b></font>

The project main objectives were: 
- Perform data wrangling (gathering, assessing and cleaning) on the provided sources of 
data. 
- Store, analyze, and visualize the wrangled data. 
- Reporting on: 
<ol>
    <li>Data wrangling efforts.</li>
    <li>Data analysis and visualizations.</li> 
</ol>

<a id='gathering'></a>
<font color=#0877cc size=4><b>Step 1: Gathering Data</b></font>

In this step, the three pieces of data were gathered and loaded as pandas dataframes: 
- The WeRateDogs Twitter archive file ('twitter_archive_enhanced.csv') was provided from the Udacity Classroom
- The tweet image predictions ('image-predictions.tsv'). This file was be downloaded programmatically using the Requests library from a provided <a href="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv">URL</a>. 
- Each tweet's entire set of JSON data in a file called 'tweet_json.txt' was obtains also from the Udacity Classroom due to an issue with getting the Twitter Developer Accoubt. Each tweet's JSON data was written to its own line. 

<a id='acd'></a>
<font color=#0877cc size=4><b>Step 2 and 3: Assessing and Cleaning Data</b></font>

While working with data, a number of observations were made. The followings show the Data Quality issues and the Data Tidiness issues observed as well as the solutions deployed. 

<a id='quality'></a>
<font color=#0877cc size=4><b>Quality Issues</b></font>

### **twitter_arch dataframe**

<ol>
    <ul><font color=#0877cc size=4>Observation One</font>
         <li>Columns <b>doggo</b>, <b>floofer</b>, <b>pupper</b> and <b>puppo</b> have <b>None</b> for missing values.</li>
    </ul>
     <ul><font color=#0877cc size=4>Solution</font>
         <li>Replaced <b>None</b> with <b>np.nan</b> for <b>doggo</b>, <b>floofer</b>, <b>pupper</b> and <b>puppo</b> columns.</li>
    </ul>
    <ul><font color=#0877cc size=4>Observation Two</font>
        <li>The <b>source</b> column has html tag <b>`&lta&gt`</b> which has the source and can be extracted and covertd to <b>categorical</b> datatype.</li>
    </ul>
    <ul><font color=#0877cc size=4>Solution</font>
       <li>Extracted tweet source from <b>source</b> column using <b>apply method</b> in pandas and then converted it to <b>categorical</b> datatype.
         </li>
    </ul>
     <ul><font color=#0877cc size=4>Observation Three</font>
         <li><b>text</b> column has the link for the tweets and ratings at the end which can be removed.</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Extracted rating scores from tweet <b>text</b> using <b>RegEx</b> and converted it to <b>float</b>.</li>
    </ul>
      <ul><font color=#0877cc size=4>Observation Four</font>
         <li><b>timestamp</b> column is <b>str</b> instead of <b>datetime</b></li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Convert the <b>timestamp</b> column to <b>datetime</b>.</li>
       </ul>
      <ul><font color=#0877cc size=4>Observation Five</font>
         <li>The <b>rating_numerator</b> column should of type <b>float</b> and also it should be correctly extracted.</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
        <li>Extracted the rating score correctly and converted it to <b>float</b>.</li>
      </ul>
      <ul><font color=#0877cc size=4>Observation Six</font>
         <li><b>rating_denominator</b> column has values less than 10 and values more than 10 for ratings more than one dog.</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Removed values other than 10 for <b>rating_denominator</b>.</li>
       </ul>
      <ul><font color=#0877cc size=4>Observation Seven</font>
         <li><b>expanded_urls</b> column has <b>NaN</b> values</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Dropped rows with <b>NaNs</b> values in the <b>expanded_urls</b> column..</li>
      </ul>
     <ul><font color=#0877cc size=4>Observation Eight</font>
        <li><b>name</b> column have <b>None</b> instead of <b>NaN</b> and too many unvalid values.</li>
     </ul>
     <ul><font color=#0877cc size=4>Solution</font>
         <li>Replaced <b>'None'</b> with <b>np.nan</b> in <b>twitter_arch_copy</b> name column and removed any rows with invalid names.</li>
    </ul>
</ol>

### ***twitter_json***

   <ul><font color=#0877cc size=4>Observation Nine</font>
      <li><b>id</b> column in <b>twitter_json_copy</b> name different than the other 2 data sets.</li>
   </ul>
   <ul><font color=#0877cc size=4>Solution</font>
      <li>Renamed <b>id</b> column in <b>twitter_json_copy</b> to <b>tweet_id</b>.</li>
   </ul>

<a id='tidy'></a>
<font color=#0877cc size=4><b>Tidiness Issues</b></font>

### **twitter_arch dataframe**

<ul><font color=#0877cc size=4>Observation</font>
         <li> <b>doggo</b>, <b>floofer</b>, <b>pupper</b>, <b>puppo</b> columns are all about the same things, that is dog stages.</li>
    </ul>
     <ul><font color=#0877cc size=4>Solution</font>
         <li>Created one colum <b>dog_stage</b> and removed the 4 columns.</li>
    </ul>
 
### **img_pred**
    
 <ul><font color=#0877cc size=4>Observation</font>
        <li><b>img_num</b> does not have any usage in the dataset.</li>
    </ul>
    <ul><font color=#0877cc size=4>Solution</font>
       <li>Removed <b>img_num</b> column from <b>img_pred_copy</b> dataset.
         </li>
    </ul>
    
### **img_pred**
  <ul><font color=#0877cc size=4>Observation</font>
    <li>Just 3 columns needed <b>id</b>, <b>retweet_count</b>, <b>favorite_count</b>.</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Removed unnecessary columns for <b>twitter_json_copy</b>.</li>
    </ul>

### **General**
  <ul><font color=#0877cc size=4>Observation</font>
         <li>All datasets should be combined into 1 dataset only.</li>
      </ul>
      <ul><font color=#0877cc size=4>Solution</font>
         <li>Merged all the three cleaned datasets into on dataframe.</li>
    </ul>

<a id='result'></a>
<font color=#0877cc size=4><b>Result</b></font>

The merged dataset was stored using pandas to_csv() method as `twitter_archive_master.csv` and further analysis and visualizations were carried out on it.