Pandas is open-source Python library which is used for data manipulation and analysis. It consist of data structures and functions to perform efficient operations on data. It is well-suited for working with tabular data such as spreadsheets or SQL tables. It is used in data science because it works well with other important libraries. It is built on top of the NumPy library as it makes easier to manipulate and analyze

> various tasks using Pandas:

1. **Data Cleaning, Merging and Joining:** Clean and combine data from multiple sources, handling inconsistencies and duplicates.
2. **Handling Missing Data:** Manage missing values (NaN) in both floating and non-floating point data.
3. **Column Insertion and Deletion:** Easily add, remove or modify columns in a DataFrame.
4. **Group By Operations:** Use "split-apply-combine" to group and analyze data.
5. **Data Visualization:** Create visualizations with Matplotlib and Seaborn, integrated with Pandas.

# Installing Pandas

In [15]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
import pandas as pd

In [17]:
df=pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "salary": [50000, 60000, 70000],
})

In [21]:
df.head()

Unnamed: 0,Name,Age,salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000


Read dataset

In [22]:
df=pd.read_csv('dataset/Salary_dataset.csv')

In [25]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,0,1.2,39344.0
1,1,1.4,46206.0
2,2,1.6,37732.0


In [28]:
df.tail(10)

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
20,20,6.9,
21,21,7.2,98274.0
22,22,8.0,101303.0
23,23,,113813.0
24,24,8.8,109432.0
25,25,9.1,
26,26,9.6,116970.0
27,27,,112636.0
28,28,10.4,122392.0
29,29,10.6,121873.0


In [29]:
df.shape

(30, 3)

In [30]:
df.columns

Index(['Unnamed: 0', 'YearsExperience', 'Salary'], dtype='object')

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       30 non-null     int64  
 1   YearsExperience  27 non-null     float64
 2   Salary           28 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 852.0 bytes


In [32]:
df.describe()

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
count,30.0,27.0,28.0
mean,14.5,5.148148,74385.642857
std,8.803408,2.807596,27621.205052
min,0.0,1.2,37732.0
25%,7.25,3.2,56431.0
50%,14.5,4.2,63832.5
75%,21.75,7.05,99031.25
max,29.0,10.6,122392.0


In [None]:
df.notnull()

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,True,True,True
1,True,True,True
2,True,True,True
3,True,True,True
4,True,True,True
5,True,True,True
6,True,True,True
7,True,True,True
8,True,True,True
9,True,True,True


In [34]:
df.notnull().sum()

Unnamed: 0         30
YearsExperience    27
Salary             28
dtype: int64

In [35]:
df.isnull()

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [36]:
df.isnull().sum()

Unnamed: 0         0
YearsExperience    3
Salary             2
dtype: int64

In [37]:
df1=pd.read_csv('dataset/Soil Nutrients.csv')

In [38]:
df1.head()

Unnamed: 0,Name,Fertility,Photoperiod,Temperature,Rainfall,pH,Light_Hours,Light_Intensity,Rh,Nitrogen,Phosphorus,Potassium,Yield,Category_pH,Soil_Type,Season,N_Ratio,P_Ratio,K_Ratio
0,Strawberry,Moderate,Day Neutral,20.887923,747.860765,6.571548,13.091483,533.762876,91.197196,170.800381,118.670058,243.331211,20.369555,low_acidic,Loam,Summer,10.0,10.0,10.0
1,Strawberry,Moderate,Day Neutral,18.062721,711.104329,6.251806,13.063016,505.789101,91.939623,179.290364,121.020244,246.910378,20.402751,low_acidic,Loam,Spring,10.0,10.0,10.0
2,Strawberry,Moderate,Short Day Period,16.776782,774.038247,6.346916,12.945927,512.985617,91.387286,181.440732,116.936806,242.699601,19.158847,low_acidic,Loam,Summer,10.0,10.0,10.0
3,Strawberry,Moderate,Short Day Period,14.281,665.633506,6.259598,13.318922,484.860067,91.254598,176.165282,122.233153,237.096892,20.265745,low_acidic,Loam,Summer,10.0,10.0,10.0
4,Strawberry,Moderate,Day Neutral,21.44449,806.531455,6.384368,13.312915,512.747307,92.354829,182.935334,126.088234,243.880364,20.397336,low_acidic,Loam,Spring,10.0,10.0,10.0


In [39]:
df1.shape

(15400, 19)

In [41]:
df1.isnull().sum()

Name               0
Fertility          0
Photoperiod        0
Temperature        0
Rainfall           0
pH                 0
Light_Hours        0
Light_Intensity    0
Rh                 0
Nitrogen           0
Phosphorus         0
Potassium          0
Yield              0
Category_pH        0
Soil_Type          0
Season             0
N_Ratio            0
P_Ratio            0
K_Ratio            0
dtype: int64

In [None]:
#df1.to_csv("output1.csv", index=False)

In [44]:
df1["Fertility"]

0        Moderate
1        Moderate
2        Moderate
3        Moderate
4        Moderate
           ...   
15395    Moderate
15396    Moderate
15397    Moderate
15398    Moderate
15399    Moderate
Name: Fertility, Length: 15400, dtype: object

In [50]:
df1['Rh']

0        91.197196
1        91.939623
2        91.387286
3        91.254598
4        92.354829
           ...    
15395    65.057374
15396    66.747340
15397    65.803531
15398    64.563183
15399    63.412816
Name: Rh, Length: 15400, dtype: float64

In [51]:
df1['Rh'].isnull().sum()

np.int64(0)