### Assess

- Assessing your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).

- Quality issues means issues with content eg missing, duplicate or incorrec data

- untidy or messy data has specific structural(columns, rows or table) issues which can slow down the process.

- Issues can be assessed visually or programmatically.

- Dirty data is also known as low quality data

#### Assessment steps
- Detect issue
- Document issue

- visual assessment can be directed(Systematically looking through each table of data in a Jupyter notebook or spreadsheet) or non-directed(Scrolling through the data looking for interesting and relevant issues)

### Be sure to seperate detected issues by Quality and Tidiness


#### Quality
- 'treatments' table: missing HbA1c changes
- 'patients' table: zip code is a float not a string
- 'patients' table: zip code has four digits sometimes
- 'patients' table: Tim Neudorf is 27 inches instead of 72 inches
- 'patients' table: full state names sometimes, abbreviations other times

### What is the Difference Between Assessing and Exploring?

#### Data wrangling is about:

- Gathering the right data
- Assessing the data's quality and structure
- Modifying the data to make it clean

>However, your assessments and modification will not make your analysis, visualizations, or models better. It just makes them work.

#### Exploratory Data Analysis (EDA) is about:

- Exploring the data with simple visualizations that summarize the data's main characteristics
- Augmenting the data, for example, removing outliers and feature engineering

### Data Quality Dimensions

>Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

- Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
- Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.
>These are listed in decreasing order of severity, meaning that the dimension listed first, completeness, is the most important.

>Regarding the other data quality research mentioned in the video, the additional dimensions are specific cases of these four dimensions listed above. Example: currency, defined as follows: the degree to which data is current with the world that it models. Currency can measure how up-to-date data is. Currency is a specific case of accuracy in data in the sense that out-of-date data is (usually) valid but wrong. In other words, our definition of accuracy can include currency.


## Flat file structure
>Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

>Advantages of flat files include:

- They're text files and therefore human readable
- Lightweight
- Simple to understand
- Software that can read/write text files is ubiquitous, like text editors
- Great for small dataset

>Disadvantages of flat files, in comparison to relational databases, for example, include:

- Lack of standards.
- Data redundancy
- Sharing data can be cumbersome
- Not great for large datasets


web scraping, which allows us to extract data from websites using code.

How Does Web Scraping Work?
Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers . We'll be using a Python parser called Beautiful Soup. 

Accessing the HTML
Manual Access
The quick way to get HTML data is by saving the HTML file to your computer manually. You can do this by clicking Save in your browser.

Programmatic Access
Programmatic access is preferred for scalability and reproducibility. Two options include:




In [None]:
# Downloading HTML file programmatically. We'll explore this code in more detail later
import requests
url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
response = requests.get(url)

# Save HTML to file

with open("et_the_extraterrestrial.html", mode='wb') as file:
    file.write(response.content)

In [None]:
# Working with the response content live in your computer's memory using the BeautifulSoup HTML parser
# Work with HTML in memory

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')

Different Types of HTML Elements

Heading elements are used for section headings.
<h1>Highest Level Heading Content</h1>
<h2>Second Level Heading Content</h2>
<h3>Third Level eading Content</h3>
<h4>Fourth Level Heading Content</h4>

Paragraph elements are used for standard blocks of text
<p>This is just a block of text.</p>

Span elements are used to group text within another block of text, often for styling
<p>This block of text has a <span>element</span> inside it.</p>

Image elements are used to embed images in a web page
<img src="image-file.jpg" alt="text that describes the image" />

Elements Can Go Inside Other Elements
We can create a tree structure in HTML by putting elements inside other elements. To do this we often use a <div> element as a container. <div elements are used to group chunks of content together.
                                                                                                                                               
<div>
    <h1>This is a heading.</h1>
    <p>This is a paragraph.</p>
</div>                                                                                                                                            





In [None]:
# import beautiful soup

from bs4 import BeautifulSoup

# make the soup

with open(filepath) as file:
    soup = BeautifulSoup(file, 'lxml')
    
# find and extract data using the find() function
soup.find('text_string')

#for example
soup.find('title')

#we get the title element of the webpage, and not the title of the movie.
<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

#To get the movie title only, we'll need to do some string slicing.
#We can use .contents to return a list of the tag's children. 
#Because there's only one item with the title tag, the list is one item long so we can access it using the index 0:

soup.find('title').count[0][:-len(' - Rotten Tomatoes')]

# this gives
'The Extra-Terrestrial\xa0(1982)'




In [1]:
from bs4 import BeautifulSoup
import os

In [4]:
df_list = []
folder = r"C:\Users\veikt\Documents\Victor_Obi\2022\Data_analysis_project\Udacity\Data_Wrangling\rt-html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie-html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').conyents[0][:-len(' - Rotten Tomatoes')]
        print(title)
        break
    
        

NameError: name 'movie' is not defined