# Write a Data Science Blog Post @Udacity - Become a Data Scientist

by Armin Sedlmeyr, a Udacity student from Munich, Germany working @BMW Group

## Software used:
pandas, 
NumPy, 
requests, 
tweepy, 
json

## General Project Requirements:
> **Code Functionality and Readability**:
- [ ] Code has easy-to-follow logical structure. The code uses comments effectively and/or Notebook Markdown cells correctly. The steps of the data science process (gather, assess, clean, analyze, model, visualize) are clearly identified with comments or Markdown cells, as well. The naming for variables and functions should be according to PEP8 style guide.
- [ ] All the project code is contained in a Jupyter notebook, which demonstrates successful execution and output of the code.
- [ ] Code is well documented and uses functions and classes as necessary. All functions include document strings. DRY principles are implemented.

> **Data**:
- [ ] Project follows the CRISP-DM process outlined for questions through communication. This can be done in the README or the notebook. If a question does not require machine learning, descriptive or inferential statistics should be used to create a compelling answer to a particular question.
- [ ] Categorical variables are handled appropriately for machine learning models (if models are created). Missing values are also handled appropriately for both descriptive and ML techniques. Document why a particular approach was used, and why it was appropriate for a particular situation.

> **Analysis, Modeling, Visualization**:
- [ ] There are between 3-5 questions asked, related to the business or real-world context of the data. Each question is answered with an appropriate visualization, table, or statistic.

> **Github Repository**:
- [ ] Student must have a Github repository of their project. The repository must have a README.md file that communicates the libraries used, the motivation for the project, the files in the repository with a small description of each, a summary of the results of the analysis, and necessary acknowledgements. Students should not use another student's code to complete the project, but they may use other references on the web including StackOverflow and Kaggle to complete the project.

> **Blog Post**:
- [ ] Student must have a blog post on a platform of their own choice (can be on their website, a Medium post or Github blog post). Student must communicate their results clearly. The post should not dive into technical details or difficulties of the analysis - this should be saved for Github. The post should be understandable for non-technical people from many fields.
- [ ] Student must have a title and image to draw readers to their post.
- [ ] There are no long, ongoing blocks of text without line breaks or images for separation anywhere in the post.
- [ ] Each question is answered with a clear visual, table, or statistic that provides how the data supports or disagrees with some hypothesis that could be formed by each question of interest.


# Introduction

#### Data Set Information

**Airbnb Open Data.** Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA/in Boston, MA.

<ins>Listings</ins>, including full descriptions and average review score

<ins>Reviews</ins>, including unique id for each reviewer and detailed comments

<ins>Calendar</ins>, including listing id and the price and availability for that day

#### Data used
- kaggle-seattle-air-bnb [link](https://www.kaggle.com/airbnb/seattle)
- kaggle-boston-air-bnb [link](https://www.kaggle.com/airbnb/boston)

#### Table of Contents
<ul>    
<li><a href="#Data Wrangling">1 Data Wrangling</a></li> 
<li><a href="#Gather">1.1 Gather</a></li>
<li><a href="#Assess">1.2 Assess</a></li>
<li><a href="#Clean">1.3 Clean</a></li>
<li><a href="#Analyse Data">2 Analyse Data</a></li>
<li><a href="#Explore">2.1 Explore - Descriptive Statistics</a></li>
<li><a href="#Draw Conclusions">2.2 Draw Conclusions - Inferential Statistics</a></li>
<li><a href="#Communicate the results">2.3 Communicate the results</a></li>  
</ul>

<a id='Data Wrangling'></a>
## 1. Data Wrangling
<a id='Gather'></a>
## 1.1 Gather


<a id='Assess'></a>
## 1.2 Assess

> **General Notes about this step of Data Wrangling**:
- **Types of assessment**:
    - <ins>Visual assessment:</ins> scrolling through the data in your preferred software application (Google Sheets, Excel, a text editor, etc.).
        - df.head()
        - df.tail()
        - df
        - df.sample()
    - <ins>Programmatic assessment</ins>: using code to view specific portions and summaries of the data (pandas' head, tail, and info methods, for example).
        - df.duplicated()
        - .head (DataFrame and Series)
        - .tail (DataFrame and Series)
        - .sample (DataFrame and Series)
        - .info (DataFrame only)
        - .describe (DataFrame and Series)
        - .value_counts (Series only)
        - all_columns = pd.Series(list(df1)) + pd.Series(list(df2)) + pd.Series(list(df3)) #find out duplicate columns between datasets
        - all_columns[all_columns.duplicated()] #find out duplicate columns between datasets
        - Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)!
- **Quality**
    - <ins>Dirty data</ins> = low quality data = content issues
    - <ins>Quality dimensions</ins>:
        - _Completeness_: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
        - _Validity_: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
        - _Accuracy_: inaccurate data is wrong data that is valid (meaning is being technically possible, e.g. a height, that is way to low, but still is valid in the dataset, or someone named "Dsvid", that is not illegal, e.g. technically possible, but not probable). It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
        - _Consistency_: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired. (e.g. an abbreviated country, and in the same table the country being written out)!        
    - <ins>Sources of dirty data</ins>:
        - We're going to have user entry errors.
        - In some situations, we won't have any data coding standards, or where we do have standards they'll be poorly applied, causing problems in the resulting data
        - We might have to integrate data where different schemas have been used for the same type of item.
        - We'll have legacy data systems, where data wasn't coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
        - Some of our data won't have the unique identifiers it should.
        - Other data will be lost in transformation from one format to another.
        - And then, of course, there's always programmer error.
        - And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that's not our fault.!
-  **[Tidiness](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)**
    - Messy data = untidy data = structural issues!
    - Tidy data requirement:
        - Each variable forms a column.
        - Each observation forms a row.
        - Each type of observational unit forms a table.
    - This section describes <ins>the five most common problems with messy datasets</ins>, along with their remedies:
        - Column headers are values, not variable names.
        - Multiple variables are stored in one column.
        - Variables are stored in both rows and columns.
        - Multiple types of observational units are stored in the same table.
        - A single observational unit is stored in multiple tables.

<a id='Clean'></a>
## 1.3 Clean

<a id='Clean'></a>
## 2 Analyse Data
<a id='Explore'></a>
### 2.1 Explore - Descriptive Statistics



<a id='Draw Conclusions'></a>
### 2.2 Draw Conclusions - Inferential Statistics




<a id='Communicate the results'></a>
### 2.3 Communicate the results

