# Pandas

## 1\) Importing dataset
Creation of a dataframe, a two dimensional data structure with rows and columns. It is similar to a dictionary in which we have a list of values(rows) for each key(columns). It is possible to set and index column to select rows.
The dataframe can be created from a dictionary, a list of lists, a list of dictionaries, a list of tuples, a dictionary of series, a numpy array, a series, another dataframe.
To import the dataframe from a csv file, we can use the function:

```python  
pd.read_csv(file_path, index_col="Column_to_set_as_index")
```

--> from a csv file

In [2]:
import pandas as pd
df_results=pd.read_csv(r"C:\Users\miche\Documenti\Csv files\Data_Stack_2019\survey_results_public.csv", index_col="Respondent")
df_questions=pd.read_csv(r"C:\Users\miche\Documenti\Csv files\Data_Stack_2019\survey_results_schema.csv", index_col="Column")


--> from a dictionary

In [3]:
dic={"first":["Michele", "Gabriele", "Raffaele"], 
     "last": ["Capitollo", "Turco", "Turco"],
     "age": [20, 16, 10],
     "pokemon": ["piplup", "turtwig", "chimchar"],
     "email": ["michele@mail.it","gabriele@mail.it","raffaele@mail.it"]}
df_people=pd.DataFrame(dic)

A dataframe has different attributes and methods:

#1\) `df.shape`    -> Numbers of rows and columns (ATTRIBUTE)  
#2\) `df.info()`     -> Numbers of rows and columns + names and datatypes of each column  
#3\) `df.head(int)`  -> Shows the first n rows of the dataframe  
#4\) `df.tail(int)`  -> Shows the last n rows of the dataframe  
#5\) `df.columns`    -> Shows the name of each column  
#6\) `df.index`     -> Shows the name of each label  
#7\) `df.sort_index` -> Sorts the indexes

## 2\) Accessing columns
We can access a single column in the same way we access the value associated to a specific key in a dictionary: however the return type will be a series, a one dimensional array:  
#1\) `df[column_to_access]`  -> Returns a series with the selected column  
#2\) `df[[list_of_columns]]` -> Returns a dataframe with the selected columns

```python


In [4]:
series_first_names=df_people["first"] # Series
df_emails_names=df_people[["first"]] # DataFrame
print(type(series_first_names))
print(type(df_emails_names))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


## 3\) Series
Series have different methods and attributes too:  
#1\) `serie.value_counts()`  -> returns each value in a series and it counts how many times it appears   
#2\) `serie.min()`           -> returns the smallest element of the series  
#3\) `serie.max()`           -> returns the biggest element of the series  
#4\) `serie.sum()`          -> returns the sum of the elements in the series

In [5]:
series_age=df_people["age"]
min_age=series_age.min()


## 4\) Accessing rows
To access also rows we need to use the following methods:  
#1\) `df.iloc[[list of rows],[list of columns]]`  -> Shows the nth row and the nth column (Integer location, the rows are accessed by index).  
#2\) `df.loc[[list of rows],[list of columns]]`  -> Shows the nth row and the nth column (The label of columns and rows need to be passed to access them).

In [6]:
int_first_people=df_people.iloc[0] # First row = Series
reduced_data=df_results.loc[1:2,"Hobbyist":"Employment"] #Slicing is inclusive 


As we have seen, the rows can be represented by labels: to create a label, we need to use the method:  
#1\) `df.set_index("column_to_set_as_index", Inplace=True/False)` (It does not change the df, unless we specify Inplace=True)   
#2\) `df.reset_index()` it resets the index of the dataframe

In [7]:
df_people.set_index("email", inplace=True)
df_people

Unnamed: 0_level_0,first,last,age,pokemon
email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
michele@mail.it,Michele,Capitollo,20,piplup
gabriele@mail.it,Gabriele,Turco,16,turtwig
raffaele@mail.it,Raffaele,Turco,10,chimchar


## 5\) Filtering
When we compare the column series to a specific value, we get another series of bool values: we can then pass that serie to the dataframe to get the matching values.  
To get the opposite boolean series, we can add a tilde before the filter to negate it.  
#0\) `filter = df["column"] == "value"`  
#0\) `filter_opposite = ~df["column"] == "value"` -> Returns the opposite boolean series 
We can filter the dataframe by using the following methods:  
#1\) `df[df["column"] == "value"]` -> Returns the rows that satisfy the condition  
#2\) `df[(df["column"] == "value") & (df["column"] == "value")]` -> Returns the rows that satisfy both the condition  
#3\) `df[(df["column"] == "value") | (df["column"] == "value")]` -> Returns the rows that satisfy one of the two the condition  
#4\) `df[df["column"].isin(["value1", "value2"])]` -> Returns the rows that satisfy the condition (the values are in the list passed as argument of the method isin)  
#5\) `df[df["column"].isnull()]` -> Returns the rows that satisfy the condition (only considering the null values of the column)  
#6\) `df[df["column"].notnull()]` -> Returns the rows that satisfy the condition (only considering the not null values of the column)  
#7\) `df[df["column"].str.contains("value")]` -> Returns the rows that satisfy the condition (only considering the values of the column that contain the string passed as argument of the method contains)  
#8\) `df[df["column"].str.startswith("value")]` -> Returns the rows that satisfy the condition (only considering the values of the column that start with the string passed as argument of the method startswith)  
#9\) `df[df["column"].str.endswith("value")]` -> Returns the rows that satisfy the condition (only considering the values of the column that end with the string passed as argument of the method endswith)


In [8]:
filt1= (df_people["first"]=="Michele") &  (df_people["last"]=="Capitollo")
filt2= (df_people["last"]=="Turco")
michele=df_people[filt1]
gabriele=df_people[~filt1]
result=df_people[filt2]


high_salary=(df_results["ConvertedComp"]>70000)
filtered=(df_results[high_salary])[["ConvertedComp","Employment","EdLevel","LanguageWorkedWith"]]
list_of_countries=["United States","India", "United Kingdom", "Germany"]
#To check for multiple values in a column, we can firstly create a list and then use the isin function.
filt_country=df_results["Country"].isin(list_of_countries)
#We can also consider the filter created as selector for rows in the loc function
country=df_results.loc[filt_country,["Country","Employment"]]
print(type(result[["first"]]))

<class 'pandas.core.frame.DataFrame'>


# 6\) Updating rows and columns  
We can update values of the entire dataframe by using the following methods:  
#1\) `df = df.applymap(function)` -> Updates all the values of the dataframe with the value returned by the function passed as argument  
#2\) `df = df.applymap(lambda x: function)` -> Updates all the values of the dataframe with the value returned by the function passed as argument  
#3\) `df = df.replace(dictionary)` -> Updates all the values of the dataframe with the value returned by the dictionary passed as argument  
#4\) `df = df.replace([list of values], [list of values])` -> Updates all the values of the dataframe with the value returned by the dictionary passed as argument.  

We can update the name of the columns by using the following methods:  
#1\) `df.columns = ["column1", "column2", "column3"]` -> Updates the name of the columns with the list passed as argument    
#2\) `df.rename(columns={"old_name": "new_name"}, inplace=True)` -> Updates the name of the columns with the dictionary passed as argument (It does not change the df, unless we specify Inplace=True)  

We can update the values of a column by using the following methods:  
#1\) `df["column"] = "value"` -> Updates all the values of the column with the value passed as argument  
#2\) `df["column"] = df["column"].apply(function)` -> Updates all the values of the column with the value returned by the function passed as argument  
#3\) `df["column"] = df["column"].apply(lambda x: function)` -> Updates all the values of the column with the value returned by the function passed as argument.  
#4\) `df["column"] = df["column"].map(dictionary)` -> Updates all the values of the column with the value returned by the dictionary passed as argument (it only works with series). The unchanged values are set to NaN.   
#5\) `df["column"] = df["column"].replace(dictionary)` -> Updates all the values of the column with the value returned by the dictionary passed as argument. The unchanged values are not set to NaN.  
#6\) `df["column"] = df["column"].replace([list of values], [list of values])` -> Updates all the values of the column with the value returned by the dictionary passed as argument.  
#7\) `df["column"] = df["column"].str.string_method()` -> Updates all the values of the column with the value returned by the string method passed as argument.

We can update single values of a column by using the following methods:  
#1\) `df.loc[df["column"] == "value", "column"] = "new_value"` -> Updates the value of the column with the new value passed as argument (only considering the rows that satisfy the condition)  
#2\) `df.loc[df["column"] == "value", "column"] = df["column"].apply(function)` -> Updates the value of the column with the value returned by the function passed as argument (only considering the rows that satisfy the condition)  
#3\) `df.loc[df["column"] == "value", "column"] = df["column"].apply(lambda x: function)` -> Updates the value of the column with the value returned by the function passed as argument (only considering the rows that satisfy the condition)  



In [9]:
df_people_type=df_people.applymap(type)
print("Dataframe that converts each entry in its datatype:")
print(df_people_type)
series_age10= df_people["age"]=df_people["age"].apply(lambda x: 10*x)
print(series_age10)

Dataframe that converts each entry in its datatype
                          first           last            age        pokemon
email                                                                       
michele@mail.it   <class 'str'>  <class 'str'>  <class 'int'>  <class 'str'>
gabriele@mail.it  <class 'str'>  <class 'str'>  <class 'int'>  <class 'str'>
raffaele@mail.it  <class 'str'>  <class 'str'>  <class 'int'>  <class 'str'>
email
michele@mail.it     200
gabriele@mail.it    160
raffaele@mail.it    100
Name: age, dtype: int64


# 7) Adding and Removing rows and columns
To delete a column, we can use the following methods:  
#1\) `df.drop("column", axis=1, inplace=True)` -> Deletes the column passed as argument (It does not change the df, unless we specify Inplace=True)  
#2\) `df.drop(["column1", "column2"], axis=1, inplace=True)` -> Deletes the columns passed as argument (It does not change the df, unless we specify Inplace=True)  
To delete a row, we use the same methods, but we need to specify axis=0.  
#1\) `df.drop("row", axis=0, inplace=True)` -> Deletes the row passed as argument (It does not change the df, unless we specify Inplace=True)  
#2\) `df.drop(["row1", "row2"], axis=0, inplace=True)` -> Deletes the rows passed as argument (It does not change the df, unless we specify Inplace=True)
To add a column, we can use the following methods:
#1\) `df["new_column"] = "value"` -> Adds a new column with the name "new_column" and the value "value"
#2\) `df["new_column"] = df["column"].apply(function)` -> Adds a new column with the name "new_column" and the value returned by the function passed as argument
#3\) `df["new_column"] = df["column"].apply(lambda x: function)` -> Adds a new column with the name "new_column" and the value returned by the function passed as argument
It is also possible to add a row by creating a new row and appending it to the dataframe:
#1\) `new_row = {"column1": "value1", "column2": "value2", "column3": "value3"}` -> Creates a new row
#2\) `df = df.append(new_row, ignore_index=True)` -> Appends the new row to the dataframe (It does not change the df, unless we specify Inplace=True)
To add a row, we can also use the following methods:
#1\) `df.loc["new_row"] = "value"` -> Adds a new row with the name "new_row" and the value "value"
#2\) `df.loc["new_row"] = df["column"].apply(function)` -> Adds a new row with the name "new_row" and the value returned by the function passed as argument



