# Introduction to Pandas: Data Reading

In this blog, we'll explore the process of reading and working with data using Pandas. No prior experience is necessary – we'll start from the basics and guide you through every step with simple explanations and practical examples.

By the end of this journey, you'll have the skills to effortlessly import, explore, and make sense of data from various sources. So, let's dive in to it!

Topics
- A. Importing and Reading Data Files
    - A.1. Reading Data
    - A.2. Naming Columns
    - A.3. Handling Missing Values
    - A.4. Using Different Separators
- B. Exploring the Data
- C. Saving the Data

## A. Importing and Reading Data Files

Before we start, let's import our `Pandas` module first.

In [1]:
import pandas as pd

### A.1. Reading Data

Reading data is a fundamental step in data analysis, and Pandas simplifies this process by offering functions for importing data from various sources, including CSV, Excel, and databases. 

For this tutorial, I'm going to use a CSV file contaitning the following data. 

```csv
John Doe,28,Male,45000,johndoe@example.com
Jane Smith,35,Female,60000,janesmith@example.com
Mike Johnson,22,Male,,mikejohnson@example.com
Emily Davis,,Female,55000,emilydavis@example.com
Chris Brown,40,,75000,chrisbrown@example.com
Anna Lee,27,Female,62000,annalee@example.com
David Clark,32,Male,68000,davidclark@example.com
Sophia Kim,29,Female,51000,sophiakim@example.com
Kevin Wilson,38,Male,,kevinwilson@example.com
Linda Johnson,45,Female,72000,lindajohnson@example.com
Robert Smith,33,Male,58000,robertsmith@example.com
Mary Wilson,,Female,62000,marywilson@example.com
William Davis,28,Male,48000,williamdavis@example.com
Jennifer Lee,29,Female,53000,jenniferlee@example.com
Michael Brown,36,Male,70000,michaelbrown@example.com
Patricia Miller,31,Female,56000,patriciamiller@example.com
James Taylor,42,Male,80000,jamestaylor@example.com
Karen Anderson,34,Female,,karenanderson@example.com
Joseph Martinez,26,Male,49000,josephmartinez@example.com
```

Like what I mentioned earlier, Pandas has built-in functions for reading different data files. Some of these are `read_csv`, `read_xlsx`, `read_html`, `read_json`, `read_sql`, etc.

In [20]:
# filename
csv_url = "data.csv"

# reading csv files
df = pd.read_csv(csv_url)

df

Unnamed: 0,John Doe,28,Male,45000,johndoe@example.com
0,Jane Smith,35.0,Female,60000,janesmith@example.com
1,Mike Johnson,22.0,Male,?,mikejohnson@example.com
2,Emily Davis,,Female,55000,emilydavis@example.com
3,Chris Brown,40.0,,75000,chrisbrown@example.com
4,Anna Lee,27.0,Female,62000,annalee@example.com
5,David Clark,32.0,Male,68000,davidclark@example.com
6,Sophia Kim,29.0,Female,51000,sophiakim@example.com
7,Kevin Wilson,38.0,Male,_,kevinwilson@example.com
8,Linda Johnson,45.0,Female,72000,lindajohnson@example.com
9,Robert Smith,33.0,Male,58000,robertsmith@example.com


By calling the `df` variable, we're able to see what the data from our source file.

### A.2. Naming Columns 
As you notice if we call our data table, it sets the first row of text as its column-name/header by default. But in our CSV file, we don't have a header for each column so it set the first row of data as the header. To prevent it from doing so, we can use the `header` argument and set it to `None`.

In [21]:
df = pd.read_csv(csv_url, header=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,John Doe,28.0,Male,45000,johndoe@example.com
1,Jane Smith,35.0,Female,60000,janesmith@example.com
2,Mike Johnson,22.0,Male,?,mikejohnson@example.com
3,Emily Davis,,Female,55000,emilydavis@example.com
4,Chris Brown,40.0,,75000,chrisbrown@example.com


By setting our `header` to `None`, we avoided setting the first row as header, and set our header to numbers instead. To name give each column a proper header name, we can use the `names` argument.

In [22]:
df = pd.read_csv(csv_url, header=None, names=["Name", "Age", "Sex", "Income", "Email"])
df.head()

Unnamed: 0,Name,Age,Sex,Income,Email
0,John Doe,28.0,Male,45000,johndoe@example.com
1,Jane Smith,35.0,Female,60000,janesmith@example.com
2,Mike Johnson,22.0,Male,?,mikejohnson@example.com
3,Emily Davis,,Female,55000,emilydavis@example.com
4,Chris Brown,40.0,,75000,chrisbrown@example.com


### A.3. Handling Missing Values
If we look back to our table, especially in "Income" column, we will notice some of the missing data got different symbols.

In [39]:
df["Income"]

0     45000
1     60000
2         ?
3     55000
4     75000
5     62000
6     68000
7     51000
8         _
9     72000
10    58000
11    62000
12    48000
13    53000
14    70000
15    56000
16    80000
17       --
18    49000
Name: Income, dtype: object

You can convert these symbols to into `NaN` by mentioning it in the `na_values` argument.

In [43]:
df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
)
df["Income"]

0     45000.0
1     60000.0
2         NaN
3     55000.0
4     75000.0
5     62000.0
6     68000.0
7     51000.0
8         NaN
9     72000.0
10    58000.0
11    62000.0
12    48000.0
13    53000.0
14    70000.0
15    56000.0
16    80000.0
17        NaN
18    49000.0
Name: Income, dtype: float64

### A.4. Using Different Separators
If you ever ran into a CSV with different separator (let's say your data file got `-` instead of `,`), you can use the `delimiter` or `sep` argument.

```python
df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
    sep="-" # changes separator
)
```

`sep` and `delimiter` works the same.

```python
df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
    delimiter="-" # changes separator
)
```

You can often use `delimiter` or `sep` interchangeably for simple cases where a single character separates values. However, if you need to handle more complex separation patterns or use regular expressions, then `sep` offers more flexibility. 

Here's an example `sep` using a regular expression to separate by either a comma or a semicolon:

```python
df = pd.read_csv('data.csv', sep='[,;]')
```

## B. Exploring the Data

Pandas also has built-in methods for viewing the basic information about our data.

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    19 non-null     object 
 1   Age     17 non-null     float64
 2   Sex     18 non-null     object 
 3   Income  19 non-null     object 
 4   Email   19 non-null     object 
dtypes: float64(1), object(4)
memory usage: 520.0+ bytes


The `info()`, from the name itself, returns the basic information about the data table, data such as *Column*, *Dtype (data type)*, etc.

If you want to access the first or last few rows of the table, you can use `head()` and `tail()`.

In [24]:
# head()
df.head()

Unnamed: 0,Name,Age,Sex,Income,Email
0,John Doe,28.0,Male,45000,johndoe@example.com
1,Jane Smith,35.0,Female,60000,janesmith@example.com
2,Mike Johnson,22.0,Male,?,mikejohnson@example.com
3,Emily Davis,,Female,55000,emilydavis@example.com
4,Chris Brown,40.0,,75000,chrisbrown@example.com


In [25]:
# tail()
df.tail()

Unnamed: 0,Name,Age,Sex,Income,Email
14,Michael Brown,36.0,Male,70000,michaelbrown@example.com
15,Patricia Miller,31.0,Female,56000,patriciamiller@example.com
16,James Taylor,42.0,Male,80000,jamestaylor@example.com
17,Karen Anderson,34.0,Female,--,karenanderson@example.com
18,Joseph Martinez,26.0,Male,49000,josephmartinez@example.com


The `head()` returns the first rows of the data table while the `tail()` returns the last rows of the data table. They return 5 rows by default.

 You can pass a number on both functions to specify how many rows do you want to to be returned.

In [26]:
df.head(3)

Unnamed: 0,Name,Age,Sex,Income,Email
0,John Doe,28.0,Male,45000,johndoe@example.com
1,Jane Smith,35.0,Female,60000,janesmith@example.com
2,Mike Johnson,22.0,Male,?,mikejohnson@example.com


In [27]:
df.tail(7)

Unnamed: 0,Name,Age,Sex,Income,Email
12,William Davis,28.0,Male,48000,williamdavis@example.com
13,Jennifer Lee,29.0,Female,53000,jenniferlee@example.com
14,Michael Brown,36.0,Male,70000,michaelbrown@example.com
15,Patricia Miller,31.0,Female,56000,patriciamiller@example.com
16,James Taylor,42.0,Male,80000,jamestaylor@example.com
17,Karen Anderson,34.0,Female,--,karenanderson@example.com
18,Joseph Martinez,26.0,Male,49000,josephmartinez@example.com


Here are some more functions that will return basic information:

`df.describe()` - Generate summary statistics of numeric columns.

In [36]:
df.describe()

Unnamed: 0,Age
count,17.0
mean,32.647059
std,6.143504
min,22.0
25%,28.0
50%,32.0
75%,36.0
max,45.0


`df.shape` - Get the dimensions of the DataFrame (rows, columns).

In [37]:
df.shape

(19, 5)

`df['column_name']` - Access a specific column.

In [38]:
df["Email"]

0            johndoe@example.com
1          janesmith@example.com
2        mikejohnson@example.com
3         emilydavis@example.com
4         chrisbrown@example.com
5            annalee@example.com
6         davidclark@example.com
7          sophiakim@example.com
8        kevinwilson@example.com
9       lindajohnson@example.com
10       robertsmith@example.com
11        marywilson@example.com
12      williamdavis@example.com
13       jenniferlee@example.com
14      michaelbrown@example.com
15    patriciamiller@example.com
16       jamestaylor@example.com
17     karenanderson@example.com
18    josephmartinez@example.com
Name: Email, dtype: object

## C. Saving Data
After processing your data, you can save it back to a file or a database using Pandas. For example, to save a DataFrame to a CSV file:
```python
df.to_csv('new_data.csv', index=False)
```
The `index=False` argument prevents Pandas from writing the row indices to the CSV file.

Pandas provides a rich set of functions for data manipulation and analysis, making it a powerful tool for working with tabular data in Python.

In conclusion, data reading with Pandas is an essential and powerful skill for data analysts and scientists. This Python library simplifies the process of importing data from various sources, such as CSV files, Excel spreadsheets, and databases, and provides a user-friendly interface for data exploration. By mastering Pandas' data reading capabilities, professionals can efficiently load, manipulate, and analyze datasets, making it a cornerstone of effective data analysis workflows.