# Pandas



## pandas
Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool built on top of the Python programming language.


In [None]:
import pandas as pd

## Reading csv files

The California_housing dataset is a comprehensive collection of housing data from the state of California. It is often used in machine learning studies and projects due to its diverse and rich information.

The dataset contains **20,640 observations** on **9 variables**, providing a detailed look at various factors that could influence housing prices in different districts of California. These variables might include aspects like median income, housing median age, average rooms per dwelling, average bedrooms per dwelling, population, average occupancy, latitude, and longitude.

This dataset has been utilized in various projects, such as predicting median house values in California. It's a valuable resource for anyone interested in exploring the dynamics of the housing market in California or developing predictive models for real estate prices. You can find this dataset on platforms like GitHub and Kaggle.

In [None]:
train = pd.read_csv('/content/sample_data/california_housing_train.csv')
test = pd.read_csv('/content/sample_data/california_housing_test.csv')

Let's have a look at our datasets

In [None]:
train.head()

In [None]:
test.tail()

In [None]:
train.shape

## Creating dataframe

In [None]:
Kids_data = pd.DataFrame({'Name': ["Sarah", "Amir", "Ali"], 'Age': [20, 22, 33], 'score':[16,12,14]})
Kids_data

## Index

In [None]:
Kids_data2 = Kids_data.set_index('Name')
Kids_data2

In [None]:
Kids_data3 = pd.DataFrame({'Age': [20, 22, 33], 'score':[16,12,14]},index = ["Sarah", "Amir", "Ali"])
Kids_data3

## Renaming columns

In [None]:
Kids_data2.rename(columns = {'score': 'math_score'}, inplace = True)
Kids_data2

## Dropping/Adding columns and rows

In [None]:
Kids_data2.drop(columns = 'math_score')

In [None]:
Kids_data2.drop("Ali")

In [None]:
Kids_data2['Salary'] = [10000, 200000, 99000]
Kids_data2

## Series

A series is a sequence of data values or sometimes called a list.

In [None]:
pd.Series([1, 2, 3, 4, 5])

In [None]:
pd.DataFrame([[99102345, 'Ali', 22], [98102345, 'Kamyar', 22], [98102777, 'Pardis', 21]], columns = ['Student_ID', 'Name', 'Age'])

##Selection

In [None]:
print(train.median_income)
print(train["median_income"])

We use iloc to select data based on their numerical position in the dataframe.



In [None]:
train.iloc[0, :]

In [None]:
train.iloc[[0, 1, 2], :]

With loc we need to specify the actual name of the column.



In [None]:
train.loc[0, 'total_rooms']

In [None]:
train.loc[train['housing_median_age'] > 16, :]


In [None]:
train.loc[(train['latitude'] < 40) | (train['total_rooms'] >= 1000), :]


##Summary functions

In [None]:
train['total_bedrooms'].describe()

In [None]:
train['total_bedrooms'].value_counts()

In [None]:
train['total_bedrooms'].min()

In [None]:
train.groupby('households')['total_bedrooms'].sum().sort_values(ascending = False)


In [None]:
train.isnull().sum()

In [None]:
train.dropna()

## Visualization

In [None]:
train