Pandas is the most popular Python library for data analysis.

The first step that we need to do, when we start working with pandas is to import its library.



In [None]:
import pandas as pd

# **Creating data**
There are two core objects in pandas: the DataFrame and the Series.

DataFrame
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row and a column.

For example, consider the following simple DataFrame

In [None]:
pd.DataFrame({'Morning': [8, 11], 'Evening': [16, 18]})

Similarly it can have string in it's entries as well

In [None]:
pd.DataFrame({'michael': ['It was fun.', 'It was boring.'], 'James': ['Pretty good.', 'Bland.']})

We can also set the titles to the rows in the dataframe

In [None]:
pd.DataFrame({'michael': ['It was fun.', 'It was boring.'], 'James': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B'])

# **Reading data files**
Being able to create a DataFrame. But, most of the time, we won't actually be creating our own data by hand and we'll be working on a data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is a CSV file. When you open a CSV file, you get something that looks like this:

In [None]:
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11

So a CSV file is a table of values separated by commas. Which give it the name: "Comma-Separated Values", or CSV.

Let's now see what a real dataset looks like when we read it into a DataFrame. We'll use the pd.read_csv() function to read the data into a DataFrame. This goes thusly:

In [None]:
salary_record = pd.read_csv("/ds_salaries.csv")

To print the dataset, just type the name of the Dataframe, it will return the first 5 and the last 5 rows

In [None]:
salary_record

We can use the shape attribute to check how large the resulting DataFrame is:

In [None]:
salary_record.shape

So our new DataFrame has 607 records split across 12 different columns. That's more than 7 thousand entries!

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

In [None]:
salary_record.head()

Likewise, we can get the last five rows using the **tail()** command.

You might have noticed we have 2 column acting as the index, we can set a column as the index column manually using **index_col** command

In [None]:
salary_record = pd.read_csv("/content/ds_salaries.csv", index_col=0)
salary_record.head()

# **Indexing, Selecting & Assigning**
Selecting specific values of a pandas DataFram is a step you'll require in almost every oper, so one of the first things you need to learn in working with data in Python is how to select required the data points quickly and effectively.

In [None]:
salary_record

## **Native accessors**
Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

In Python, we can access the property of an object by accessing it as an attribute. A **book** object, for example, might have a **title** property, which we can access by calling **book.title**. Columns in a pandas DataFrame work in much the same way.

Hence to access the **job_title** property of **salary_record** we can use:

In [None]:
salary_record.job_title

If we have a Python dictionary, we can access its values using the indexing **[ ]** operator. We can do the same with columns in a DataFrame:

In [None]:
salary_record['job_title']

If we have special characters in the column name, we wont we able access them through **Dataframe.column** method, this is where **Dataframe[ ]** method has advantage over the previous method.

To access a specific entry, add another **[ ]** (with the index number of the row in it) next to the column name, as shown below:

In [None]:
salary_record['job_title'][0]

# **Indexing in pandas**
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a beginner, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.

### **Index-based selection**
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
salary_record.iloc[0]

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
salary_record.iloc[:, 4]

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the **salary** column from just the first, second, and third row, we would do:

In [None]:
salary_record.iloc[:3, 4]

Or, to select just the second and third entries, we would do:

In [None]:
salary_record.iloc[1:3, 4]

### **Label-based selection**
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in **salary** column in **salary_record**, we would now do the following:

In [None]:
salary_record.loc[0]['salary']

### **Choosing between loc and iloc**
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

# **Conditional selection**
So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average salaries of the data scientist.

Suppose we want all the employees living in US, we can use the following:

In [None]:
salary_record.loc[salary_record.employee_residence =="US"]

Suppose we want to add another condition (people with exexutive level expreience), we can do that using the **&** symbol, as we've done below:

In [None]:
salary_record.loc[(salary_record.employee_residence =="US") & (salary_record.experience_level=="EX")]

### **Assigning data**
Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [None]:
salary_record.employee_residence= 'Pk'
salary_record.employee_residence

# **Summaries**
We can get the summaries of our data in the following ways:

In [None]:
salary_record.salary.describe()

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
salary_record.job_title.describe()

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

For example, to see the mean of the points allotted (e.g. what is the avg salary of a data scientist), we can use the mean() function:

In [None]:
salary_record.salary.mean()

To see a list of unique values we can use the unique() function:

In [None]:
salary_record.job_title.unique()

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:

In [None]:
salary_record.job_title.value_counts()