# Pandas:
- Pandas is a powerful python library for data analysis and manipulation. It is widely used in fields relating to data analytics and data science along with ML. Pandas can read data not only from excel (.csv/.xlsx) files but also from SQL databases.

# Basic pandas functions:
- Pandas has a lot of inbuilt functions for data analysis and manipulation. Some of them are listed below:

    - `read_csv()` &#8594; Reads the data from csv files.
    - `read_excel()` &#8594; Reads the data from excel files.
    - `apply()` &#8594; Applies a function to the specified row of the DataFrame.
    - `fillna()` &#8594; Fills the empty data of the DataFrame with an specified value (normally mean or median is used.)
    - `describe()` &#8594; Generates  information like mean,median,sum and so on for the DataFrame.
    - `value_counts()` &#8594; Returns the frequenct of a certain value in a specified column of the DataFrame.
    - `to_csv()` &#8594; Saves the DataFrame as a CSV file.


*[Dataset URL](https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer)*

# Let's get coding !

In [1]:
import pandas as pd

df = pd.read_csv(r"Salary Data.csv")

`.head()` &#8594; The `head()` function displays the top 5 rows from your df by default. To change the number of rows displayed, simply pass the number of rows you want.
`tail()` &#8594; The `tail()` function works exactly the same as teh `head()` function, except it returns the rows from the bottom. 

In [2]:
df.head(4)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0


In [3]:
df.tail(4)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0
374,44.0,Female,PhD,Senior Business Analyst,15.0,150000.0


`.describe()` &#8594; The describe function, describes the DataFrame. It returns a smaller dataframe including the count,mean,standard deviation, the quartiles and the max value of the numerical columns in the dataframe.

In [4]:
df.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,373.0,373.0,373.0
mean,37.431635,10.030831,100577.345845
std,7.069073,6.557007,48240.013482
min,23.0,0.0,350.0
25%,31.0,4.0,55000.0
50%,36.0,9.0,95000.0
75%,44.0,15.0,140000.0
max,53.0,25.0,250000.0


`.info()` &#8594; Prints the basic information like the data type of each column about the dataframe. Use ful to figure what type of funciton to use in which column.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375 entries, 0 to 374
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  373 non-null    float64
 1   Gender               373 non-null    object 
 2   Education Level      373 non-null    object 
 3   Job Title            373 non-null    object 
 4   Years of Experience  373 non-null    float64
 5   Salary               373 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.7+ KB


Sometimes, the data provided to us is not perfect, it can contain some empty data which affects data analysis. To tackle with such cases, we can either drop the empty rows or fill them up with  the mean or median of the row depending on the use case. To get the rows/ data that is empty we can use the `sum()` function on the output of `isna()`function.

In [6]:
df.isna().sum()

Age                    2
Gender                 2
Education Level        2
Job Title              2
Years of Experience    2
Salary                 2
dtype: int64

In the output, we can see that the coulumns have two missing data values. Such low number of missing values won't affect too much, but still it is considered good practice to handle these values. We will be dropping these values for now.

In [7]:
df.dropna(inplace=True) #inplace=True, replaces the existing df, setting it to False creates a new DF. 

# Accessing a column in DataFrame:
- To access only a column in our DF, we can use syntax similar to dictionaries in addition to that, we can also acces it directly using the access operator (.)

In [8]:
age_column = df.get("Age")
salary_column = df['Salary']
gender_column = df.Gender

In [9]:
print(age_column)
print("-------")
print(salary_column)
print("-------")
print(gender_column)

0      32.0
1      28.0
2      45.0
3      36.0
4      52.0
       ... 
370    35.0
371    43.0
372    29.0
373    34.0
374    44.0
Name: Age, Length: 373, dtype: float64
-------
0       90000.0
1       65000.0
2      150000.0
3       60000.0
4      200000.0
         ...   
370     85000.0
371    170000.0
372     40000.0
373     90000.0
374    150000.0
Name: Salary, Length: 373, dtype: float64
-------
0        Male
1      Female
2        Male
3      Female
4        Male
        ...  
370    Female
371      Male
372    Female
373      Male
374    Female
Name: Gender, Length: 373, dtype: object


# Getting value counts for a column:
- In pandas we can use the `value_counts()` function to get the number of values each specific entry has in a column of our DF. This is useful to get more insights in the column and the data holds.
- The function returns a Series (a data structure similar to dataframes) containing the count of unique values in the format given below:

<column_name>

<value 1> <count of value 1(the frequence of value 1 in the column)>


In [10]:
values_age = age_column.value_counts()

In [11]:
print(values_age)

Age
33.0    24
29.0    23
35.0    22
44.0    21
31.0    21
36.0    20
45.0    17
34.0    17
47.0    15
30.0    15
38.0    15
40.0    13
28.0    13
32.0    12
39.0    12
43.0    12
41.0    12
37.0    12
42.0    11
46.0    10
27.0     9
48.0     9
50.0     8
49.0     8
26.0     7
51.0     5
25.0     4
52.0     3
24.0     1
23.0     1
53.0     1
Name: count, dtype: int64


`apply()` &#8594; As the name suggests, it simply applies either an inbuilt or user-defined function to a column. It takes the function to apply as a parameter

In [12]:
def preprocess_education(education_level:str) -> str:
    education_level = education_level.strip()
    education_level = education_level.lower()

    return education_level

In [13]:
df["Education Level"] = df['Education Level'].apply(preprocess_education)

In [14]:
print(df['Education Level'][0])

bachelor's


`.loc` &#8594; The `loc` property of a dataframe gets/sets the value of specified label (row,column). Slicing can be done in the loc property, in this case the slicing output has both to and from indexes in the output.

----
Dataframes follow the given structure, and the loc property makes use of this format to get/set the data from/to specific cells. If you have used excel, think of it as acessing the value in A1 and setting it to something else. In the notation 'A1', "A" refers to the column and "1" refers to the row, which in pandas is represented by `df.loc[row,column]` format. 

| Column 1      | Column 2      |
| ------------- | ------------- |
| Row 1 | Data 1 |
| Row 2 | Data 2 |

In [15]:
#Getting the data:
print(df.loc[1,"Age"]) #Prints the value in row 1 of the "Age column"

28.0


In [16]:
#Setting the data:
df.loc[1,"Age"] = 29

In [17]:
print(df.loc[1,"Age"])

29.0


Other ways to use loc:
- Passing just the row, prints the details of data in that specific row
- Selection of multiple rows , works similar to passing just the row
- Slicing rows, works similar to list slicing but includes data of both the start and stop indexes (rows) 
- Selecting rows and columns, works similar to other selections, returns a selection. You can think of it as selecting rows from A10 to C15 in excel.


In [18]:
print(df.loc[0])

Age                                 32.0
Gender                              Male
Education Level               bachelor's
Job Title              Software Engineer
Years of Experience                  5.0
Salary                           90000.0
Name: 0, dtype: object


In [19]:
print(df.loc[[0,1]])

    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      bachelor's  Software Engineer                  5.0   
1  29.0  Female        master's       Data Analyst                  3.0   

    Salary  
0  90000.0  
1  65000.0  


In [20]:
print(df.loc[0:4])

    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      bachelor's  Software Engineer                  5.0   
1  29.0  Female        master's       Data Analyst                  3.0   
2  45.0    Male             phd     Senior Manager                 15.0   
3  36.0  Female      bachelor's    Sales Associate                  7.0   
4  52.0    Male        master's           Director                 20.0   

     Salary  
0   90000.0  
1   65000.0  
2  150000.0  
3   60000.0  
4  200000.0  


In [22]:
print(df.loc[[1,3],['Age','Gender']])

    Age  Gender
1  29.0  Female
3  36.0  Female


# To learn more about Pandas, you can refer to the following resources:
- [Pandas Documentation](https://www.python.pandas.org)
- [Pandas for Data Analysis by Wes McKinney](https://wesmckinney.com/book/)
- [Freecodecamp](https://www.youtube.com/watch?v=gtjxAH8uaP0)
