# Exploring and Inspecting Data

### Lesson Overview
Now with a better grasp of Jupyter Notebooks, we can begin our journey of digging into datasets to investigate what intrigues us. It is typically recommended when starting to focus on several important topics when inspecting a new dataset. We also need to understand how to read and write our data to move forward with the data analysis process. With these foundational skills in place, we can gradually develop our intuition about how to explore our data.

In this lesson, you will be:

- Forming and asking questions with data
- Defining data wrangling and EDA
- Gathering data
- Reading CSV files with pandas
- Using pandas to inspect and assess data

## Asking Questions

https://www.youtube.com/watch?v=0LexLA1Hres

The first step of the data analysis process is asking questions. Sometimes we ask questions first and get our data later and other times we get the data first and ask questions based on it. Here, we will practice asking questions with a real dataset.

### Dataset information
For more information from the dataset source, visit UCI's ML repository(opens in a new tab). We will dive into additional details about this dataset on the next page.

pd.read_csv

As shown in the video, you can use df = pd.read_csv('some_csv_file.csv') (where pd is the shorthand for pandas) with the related filename to read a CSV file into a pandas dataframe. We'll come back to CSV files shortly.

## Questions for a Dataset

Breast Cancer Wisconsin (Diagnostic) Dataset from UCI Machine Learning Lab
(The dataset is included in the workspace here for you as "cancer_data.csv." If you're interested, you can explore it further here, on Kaggle(opens in a new tab) or UCI's ML repository(opens in a new tab))

Attribute Information:

- ID number
- Diagnosis (M = malignant, B = benign)
- 30 features

The following ten features are computed for each cell nucleus. For each of these ten features, a column is created for the mean, standard error, and max value.

Feature	Description
Radius	Mean of distances from center to points on the perimeter
Texture	Standard deviation of gray-scale values
Perimeter	
Area	
Smoothness	Local variation in radius lengths
Compactness	Perimeter2 / Area - 1.0
Concavity	Severity of concave portions of the contour
Concave Points	Number of concave portions of the contour
Symmetry	
Fractal Dimension	"Coastline approximation" - 1

Let's use pandas to take a look at the data! Run the cells in the Jupyter Notebook below. What are good questions you can ask based on this information?

Work through the workspace notebook below to prepare for answering the questions.

## Data Wrangling and EDA

https://www.youtube.com/watch?v=EQXfxbUup0o

- Defining terms

Wrangling and EDA (exploratory data analysis) are sometimes used synonymously because their purposes often overlap. However, in this course, we will use the following to define our terms.

## EDA

exploring and augmenting data to maximize the potential of analysis, visualizations, and models; for example, engineering new features and removing outliers

## Wrangling

gathering, assessing, and cleaning data

There are plenty of examples of data wrangling and EDA concepts within the industry. For example, you may need to clean large amounts of data before entering it into a database because some fields you collected on a web page are empty. Or augment the data by normalizing and correcting any spelling errors that the user may have put into a field by mistake.

Wrangling and EDA are important not just for preparing for data analysis but afterward as well. Visualizations, for example, help communicate any findings you have, which we will detail later in this course.

If you would like to learn more about what happens in practice, check out these examples:

## Gathering Data

https://www.youtube.com/watch?v=4bwJp623Foc

Data acquisition can happen in a number of ways:

- Downloading files that are readily available
- Getting data from an API or web scraping
- Pulling data from existing databases

There may also be a need to combine data from multiple different formats.

- Defining terms
- CSV (comma separated values)

a text file with a tabular structure that holds only raw data. It is easy to process manually using code such as Python.

An example of what a CSV file would look like if you opened it in a text editor

header_name1,header_name2,header_name3
one,14,town
five,42,city
ten,30,town

## Reading CSV Files

Continuing with the practice of gathering data for analysis, we're introducing several new datasets for you to read into DataFrames. You've seen .read_csv() before, but now let's understand its functionality better.

The .read_csv() method will read any CSV-type file and transform it into a pandas DataFrame. This is incredibly useful, as you can leverage all the power of pandas on the data from the CSV. For comparison, Excel is limit bound(opens in a new tab) by the number of rows it can handle, which makes analyzing large datasets difficult.

Five major parameters from .read_csv() are the focus of this page:

- filepath = file path of the CSV being read
- sep = separator, typically ,
- header = header of the CSV file (there can be a header or not)
- index_col = which columns should be defined as the index, or create a new index numerically
- names = custom labels for each column

CSV files are not always the same, so understanding the differences and flexibility of pandas' .read_csv() method is critical. You'll also be learning how to manage DataFrame headers and indices, as well as outputting modified DataFrames into new CSV files.