# What is Pandas?

Pandas is one of the most important libraries of Python.
</br>
Pandas has data structures for data analysis. The most used of these are Series and DataFrame data structures. Series is one dimensional, that is, it consists of a column. Data frame is two-dimensional, i.e. it consists of rows and columns.

To install Pandas, you can use "pip install pandas"

In [None]:
import pandas as pd # Let's import pandas with pd

In [None]:
pd.__version__ # To print the installed vesion pandas

In [None]:
obj=pd.Series([1,"John",3.5,"Hey"])
obj

Accessing Pandas Array

In [None]:
obj[0]

In [None]:
obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2

In [None]:
obj2["b"] 

In [None]:
obj2.index

Converting dictionary to data series

In [None]:
score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}
names=pd.Series(score) # Convert to Series 
names

In [None]:
names["Tim"] 

Searching with a condition

In [None]:
names[names>=85] 

In [None]:
names[names<=80]=83
names

Presence of value in dataframe

In [None]:
"Tom" in names
"Can" in names

In [None]:
names/10 

In [None]:
names**2

In [None]:
names.isnull() 

<h1>Exercise 01</h1>
<p>Create a dictionary with your best 5 courses in Champlain College and related scores. Find out the mean of the course marks.</p>

In [1]:
import pandas as pd

# Create a dictionary of courses and their scores
courses_scores = {
    "Big Data": 90,
    "Mobile Application iOS": 85,
    "Mobile Application Android": 92,
    "Final Project 1": 88,
    "English for Professionals": 94
}

# Convert the dictionary to a Pandas Series
courses_series = pd.Series(courses_scores)

# Calculate the mean of the scores
mean_score = courses_series.mean()

# Print the results
print("Courses and Scores:\n", courses_series)
print("\nMean of Scores:", mean_score)

Courses and Scores:
 Big Data                      90
Mobile Application iOS        85
Mobile Application Android    92
Final Project 1               88
English for Professionals     94
dtype: int64

Mean of Scores: 89.8


<h1>Exercise 02</h1>
<p>List the best three courses from the pandas dataframe.</p>

In [9]:
import pandas as pd

# Create a dictionary of courses and their scores
courses_scores = {
    "Big Data": 90,
    "Mobile Application iOS": 85,
    "Mobile Application Android": 92,
    "Final Project 1": 88,
    "English for Professionals": 94
}

# Convert the dictionary to a Pandas Series
courses_series = pd.Series(courses_scores)

# Sort the courses by their scores in descending order and select the top 3
best_three_courses = courses_series.sort_values(ascending=False).head(3)

# Print the results
print("Best Three Courses:\n", best_three_courses)

Best Three Courses:
 English for Professionals     94
Mobile Application Android    92
Big Data                      90
dtype: int64


## Read data from CSV

In [None]:
insurance = pd.read_csv("insurance.csv")

#### Read the title

insurance.head()

In [None]:
insurance.dtypes

insurance.Genre.describe()

#### number of entries

In [None]:
insurance.Genre.value_counts(normalize=True) 

#### Plotting the data

In [None]:
insurance.Genre.value_counts().plot(kind="bar")

### Exercise 03
count the number of movies in the ratings.csv file. Also count the number of ratings respective to the user. Plot both of these two values in charts.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

ratings = pd.read_csv("ratings.csv")

# Count the number of unique movies
num_movies = ratings["movieId"].nunique()

# Count the number of ratings per user
ratings_per_user = ratings["userId"].value_counts()

# Plot the number of movies
plt.figure(figsize=(10, 5))
plt.bar(["Movies"], [num_movies], color='blue')
plt.title("Number of Movies")
plt.ylabel("Count")
plt.show()

# Plot the number of ratings per user
ratings_per_user.plot(kind="bar", figsize=(15, 5), color='green', title="Number of Ratings per User")
plt.ylabel("Number of Ratings")
plt.xlabel("User ID")
plt.show()

# Print results
print("Number of Movies:", num_movies)
print("Number of Ratings per User (Top 5):")
print(ratings_per_user.head())


## DataFrame
Dataframe is a structure, provided by Pandas, where you consider the input data as a table format. It considers each column as a separate entity.

In [1]:
data={"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],      
      "score":[90,80,85,75,95,60,65],      
      "sport":["Wrestling","Football","Skiing","Swimming","Tennis",
               "Karete","Surfing"],      
      "sex":["M","M","M","M","F","F","F"]}
df=pd.DataFrame(data)
df

NameError: name 'pd' is not defined

### Filtering

In [None]:
df=pd.DataFrame(data,columns=["name","sport","sex","score"])
df

#### Read first entries

In [None]:
df.head()

#### Read last entries

In [None]:
df.tail() 

Last 3

In [None]:
df.tail(3)

First 2

In [None]:
df.head(2)

#### Work on specific columns

In [None]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"])
df

### Creating pandas indexing
indexing helps to uniquely identify the entries in table.

In [None]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],
                index=["one","two","three","four","five","six","seven"])
df

#### Access One Column

In [None]:
df["sport"]

#### Access Multiple Columns
my_columns=["name","sport"]
df[my_columns]

In [None]:
df.sport # same as the previous

### Exercise 04
Add index to the insurance dataframe. Print the dataframe with indexes. (You don't need to add indexes manually. You can auto generate it.)

In [2]:
import pandas as pd

# Load the insurance dataset (file is in the same directory)
insurance = pd.read_csv("insurance.csv")

# Resetting index (if needed, this ensures an index is generated even if missing)
insurance_with_index = insurance.reset_index()

# Print the DataFrame with auto-generated indexes
print(insurance_with_index)

      index  age     sex     bmi  children smoker     region      charges
0         0   19  female  27.900         0    yes  southwest  16884.92400
1         1   18    male  33.770         1     no  southeast   1725.55230
2         2   28    male  33.000         3     no  southeast   4449.46200
3         3   33    male  22.705         0     no  northwest  21984.47061
4         4   32    male  28.880         0     no  northwest   3866.85520
...     ...  ...     ...     ...       ...    ...        ...          ...
1333   1333   50    male  30.970         3     no  northwest  10600.54830
1334   1334   18  female  31.920         0     no  northeast   2205.98080
1335   1335   18  female  36.850         0     no  southeast   1629.83350
1336   1336   21  female  25.800         0     no  southwest   2007.94500
1337   1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 8 columns]


### Exercise 05
Read two columns, age and charges from the insurance dataframe


In [2]:
import pandas as pd

# Load the insurance dataset
insurance = pd.read_csv("insurance.csv")

# Select and read the 'age' and 'charges' columns
age_charges = insurance[['age', 'charges']]

# Display the first few rows of the selected columns
print(age_charges.head())

   age      charges
0   19  16884.92400
1   18   1725.55230
2   28   4449.46200
3   33  21984.47061
4   32   3866.85520


### Loc method
this method helps to locate an index in the python dataframe. It helps to point the entries with index values

In [None]:
df.loc[["one"]]

Multiple locations value

In [None]:
df.loc[["one","two"]]

### Updating the dataframe

In [None]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"], 
                index=["one","two","three","four","five","six","seven"])
values=[18,19,20,18,17,17,18]
df["age"]=values
df

#### Update with condition

In [None]:
df["pass"]=df.score>=70
df

#### Exercise 6
Update the dataframe, insurance in such a way that only the people whose charges more than 20000 becomes true.

In [3]:
import pandas as pd

# Load the insurance dataset
insurance = pd.read_csv("insurance.csv")

# Add a new column 'high_charges' with a condition: charges > 20000
insurance['high_charges'] = insurance['charges'] > 20000

# Display the first few rows of the updated DataFrame
print(insurance.head())

   age     sex     bmi  children smoker     region      charges  high_charges
0   19  female  27.900         0    yes  southwest  16884.92400         False
1   18    male  33.770         1     no  southeast   1725.55230         False
2   28    male  33.000         3     no  southeast   4449.46200         False
3   33    male  22.705         0     no  northwest  21984.47061          True
4   32    male  28.880         0     no  northwest   3866.85520         False


### Delete a column

In [None]:
del df["pass"]
df

### Transpose the dataframe

In [None]:
scores={"Math":{"A":85,"B":90,"C":95}, "Physics":{"A":90,"B":80,"C":75}}
scores_df=pd.DataFrame(scores)
scores_df

scores_df.T

scores_df.index.name="name"
scores_df.columns.name="lesson"

### Selecting with iLoc
This locates the value with the index position

In [None]:
df.iloc[1]
df.iloc[1,[1,2,3]]
df.iloc[[1,3],[1,2,3]]

## DataFrame Indexing
data=pd.DataFrame(
    np.arange(16).reshape(4,4),
    index=["London","Paris",
           "Berlin","Istanbul"],
    columns=["one","two","three","four"])
data
data["two"]
data[["one","two"]]
data[:3]
data[data["four"]>5]
data[data<5]=0
data
## Selecting with iloc and loc
data.iloc[1]
data.iloc[1,[1,2,3]]
data.iloc[[1,3],[1,2,3]]
data.loc["Paris",["one","two"]]
data.loc[:"Paris","four"]
toy_data=pd.Series(np.arange(5),
                   index=["a","b","c",
                          "d","e"])
toy_data
toy_data[-1]

In [1]:
x = lambda a, b: a % b
print(x(5,6))

5


### Exercise 7
Print the 18th record from the insurance table. Print the age and salary of that record using iLoc

In [13]:
import pandas as pd

# Load the insurance dataset
insurance = pd.read_csv("insurance.csv")

# Print the 18th record
record_18 = insurance.iloc[17]  # Index 17 corresponds to the 18th record
print("18th Record:")
print(record_18)

# Print the age and charges of the 18th record
age_charges = insurance.iloc[17, [insurance.columns.get_loc("age"), insurance.columns.get_loc("charges")]]
print("\nAge and Charges of the 18th Record:")
print(age_charges)(age_salary)


18th Record:
age                 23
sex               male
bmi             23.845
children             0
smoker              no
region       northeast
charges     2395.17155
Name: 17, dtype: object

Age and Charges of the 18th Record:
age                23
charges    2395.17155
Name: 17, dtype: object


NameError: name 'age_salary' is not defined