# 1.1 Introduction to `pandas`

## What is a `pandas`?

With Python, we can uses "packages," which are toolkits of code that people release to accomplish certain tasks. For example, here we will use `pandas`, which is a package designed specifically for data analysis and manipulation.

In order to use `pandas` in our code, we need to use the `import` keyword:

In [1]:
import pandas as pd

## Using DataFrames in `pandas`
In order to analyze and manipulate data with `pandas`, we use tabular data in a format called a DataFrame. This is similar to how data is formatted in a spreadsheet software like Microsoft Excel or Google Sheets.

For example, we can create a DataFrame to store information about animal species and the number of legs they have. Here we call this DataFrame `df`, which is a common shorthand:

In [2]:
df = pd.DataFrame([
    ["cat", "mammal", 4],
    ["dog", "mammal", 4],
    ["ant", "bug", 6],
    ["spider", "bug", 8],
    ["lizard", "reptile", 4],
    ["snake", "reptile", 0],
    ["pig", "mammal", 4],
],
columns = ["species", "class", "number_legs"])

In [3]:
df

Unnamed: 0,species,class,number_legs
0,cat,mammal,4
1,dog,mammal,4
2,ant,bug,6
3,spider,bug,8
4,lizard,reptile,4
5,snake,reptile,0
6,pig,mammal,4


From here, we can do various operations, such as sorting the rows by a specific column:

In [4]:
df.sort_values(by="number_legs")

Unnamed: 0,species,class,number_legs
5,snake,reptile,0
0,cat,mammal,4
1,dog,mammal,4
4,lizard,reptile,4
6,pig,mammal,4
2,ant,bug,6
3,spider,bug,8


Or counting how many rows we have of a given `class` value:

In [5]:
df.value_counts("class")

class
mammal     3
bug        2
reptile    2
dtype: int64

Or a given `number_legs` value:

In [6]:
df.value_counts('number_legs')

number_legs
4    4
0    1
6    1
8    1
dtype: int64

In these above two examples, we can see that "mammal" is the most common class in our DataFrame, and "4" is the most common number of legs.

We can build on these basic operations as we work with even larger datasets.

## Working with data files

### Loading data from a file
Oftentimes we will want to analyze data that is already stored in another file. We'll go over how to do this with an example.

In this example, we can try to read a CSV file downloaded from [here](https://iranopendata.org/en/dataset/109-projects-extracted-official-website-khatam-alanbiya-2020). In particular, this dataset shows the 2020 contracts held by Khatam al-Anbiya Construction Headquarter. Because it is a CSV format, we can use the `read_csv()` function in pandas to convert this into a DataFrame. Again, we can call this DataFrame anything, but to make it easy for us to remember what it represents, I'll call it `contracts`.

Note that we pass in the relative path to the file we want. Since we are in the `english/1_Data-analysis/` subfolder, we have to use `../../` to go outside of these 2 subfolders (one `../` for each subfolder) in order to access the `input/` folder, where our CSV is stored.

In [7]:
contracts = pd.read_csv("../../input/kaches204-109-projects-extracted-official-website-khatam-alanbiya-2020-en.csv")

Now we can check to make sure our DataFrame is loaded property. For longer DataFrames like this one, `pandas` will show the first and last 5 rows by default:

In [8]:
contracts

Unnamed: 0,area,classification,project status,project name,description
0,"oil, gas, petrochemicals",field development,under construction,construction of a new gas injection station in...,
1,"oil, gas, petrochemicals",field development,finished,final design of 3d seismic data harvesting ope...,
2,"oil, gas, petrochemicals",field development,finished,seismic operations in azadegan north oil field,
3,"oil, gas, petrochemicals",field development,finished,timab-changoleh three-dimensional seismic oper...,-
4,"oil, gas, petrochemicals",refinery industries,under construction,construction and civil operations of iran lng ...,-
...,...,...,...,...,...
105,civil engineering and industry,communication and information technology,in progress,telecommunications and the sixth gas transmiss...,-
106,civil engineering and industry,special structures,special structures,nature bridge,-
107,civil engineering and industry,special structures,special structures,class bridge and sadr tunnel,-
108,civil engineering and industry,special structures,special structures,persian gulf lake crown buildings,-


We can use the `head()` function, which by default shows us the first 5 rows of our DataFrame:

In [9]:
contracts.head()

Unnamed: 0,area,classification,project status,project name,description
0,"oil, gas, petrochemicals",field development,under construction,construction of a new gas injection station in...,
1,"oil, gas, petrochemicals",field development,finished,final design of 3d seismic data harvesting ope...,
2,"oil, gas, petrochemicals",field development,finished,seismic operations in azadegan north oil field,
3,"oil, gas, petrochemicals",field development,finished,timab-changoleh three-dimensional seismic oper...,-
4,"oil, gas, petrochemicals",refinery industries,under construction,construction and civil operations of iran lng ...,-


We can also use the `shape` property of the DataFrame to see the dimensions of our data. In this case we have 110 rows and 5 columns:

In [10]:
contracts.shape

(110, 5)

### Saving data to a file
After we do out data analysis and processing, we may want to store our processed data in another file to reference later. For example, if we wanted to save our earlier DataFrame `df` from before into a CSV, we could use `to_csv()`:

In [11]:
df.to_csv("../../output/1.1_data-output-example.csv")

Note that if this command didn't work for you, then you might need to make an `output` folder in the root directory of this project!

### Other file formats

On other occasions, you may need to read/write data to other file formats like an Excel spreadsheet or a JSON file. You can always look up what functions are provided in `pandas` using the [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/io.html). For example, [it would tell you](https://pandas.pydata.org/pandas-docs/stable/reference/io.html#excel) that you can read an Excel file to a DataFrame using `read_excel()`.

And of course, you can always try using a search engine to help you if you have any problems too!