In [62]:
# !pip install pandas

Pandas is a powerful and popular Python library for data manipulation and analysis. It provides two primary data structures: **Series** (one-dimensional) and **DataFrame** (two-dimensional). These structures are highly flexible, allowing for easy data handling, cleaning, and transformation. Pandas supports operations like filtering, grouping, merging, reshaping, and statistical analysis. It integrates seamlessly with other data science libraries such as NumPy and Matplotlib. Pandas also offers robust input/output capabilities, enabling data import from and export to various formats, including CSV, Excel, SQL databases, and JSON. Its intuitive syntax and rich functionality make it essential for data analysis in Python.

In [63]:
import pandas as pd
import numpy as np

Why we use pandas?
Pandas is a powerful and popular library in Python used for data manipulation and analysis. Here are some key reasons why it's so widely used:

1. **Data Structures**: Pandas provides two primary data structures—Series (1-dimensional) and DataFrame (2-dimensional)—that are very flexible and efficient for handling various types of data.

2. **Data Cleaning**: It offers extensive tools for cleaning and preprocessing data, such as handling missing values, filtering, and transforming data.

3. **Data Manipulation**: Pandas makes it easy to manipulate data by providing functions to sort, merge, reshape, and aggregate datasets.

4. **Data Analysis**: It supports powerful data analysis operations, including grouping, pivoting, and applying statistical functions.

5. **Integration**: Pandas integrates well with other libraries and tools used for data analysis and machine learning, such as NumPy, SciPy, and scikit-learn.

6. **I/O Operations**: It provides robust methods to read from and write to various file formats, including CSV, Excel, SQL databases, and more.

7. **Performance**: While not as fast as some specialized libraries, Pandas is optimized for performance with large datasets and can handle quite a bit of data efficiently.

Overall, Pandas simplifies many aspects of working with data, making it a go-to tool for data analysts and scientists.

#   what is dataframe ?
Here are the key aspects of DataFrames explained in six points:

1. **Two-Dimensional Structure**:
   - A DataFrame is a table-like structure with rows and columns, similar to a spreadsheet or SQL table.

2. **Labeled Axes**:
   - Each row and column in a DataFrame has labels, which makes data manipulation and access more intuitive.

3. **Heterogeneous Data**:
   - Columns in a DataFrame can hold different data types (integers, floats, strings, etc.), allowing for versatile data representation.

4. **Size-Mutable**:
   - DataFrames are dynamic; you can add or remove rows and columns as needed.

5. **Data Alignment**:
   - DataFrame operations automatically align data based on row and column labels, simplifying data merging and arithmetic operations.

6. **Rich Functionality**:
   - DataFrames provide extensive built-in functions for data analysis and manipulation, such as filtering, grouping, and statistical computations.

# how we can create dataframe?
Sure, here’s a brief explanation of how to create a DataFrame in Pandas using different methods:

### 1. Creating a DataFrame from a Dictionary

A dictionary can be used where each key represents a column name and its associated value is a list of column values.

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)
```

### 2. Creating a DataFrame from a List of Lists

Lists of lists can be converted to a DataFrame by specifying the column names.

```python
import pandas as pd

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)
```

### 3. Creating a DataFrame from a List of Dictionaries

A list where each element is a dictionary representing a row.

```python
import pandas as pd

data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

df = pd.DataFrame(data)
print(df)
```

### 4. Creating a DataFrame from a CSV File

Read data from a CSV file into a DataFrame.

```python
import pandas as pd

df = pd.read_csv('data.csv')
print(df)
```

### 5. Creating a DataFrame from an Excel File

Read data from an Excel file into a DataFrame.

```python
import pandas as pd

df = pd.read_excel('data.xlsx')
print(df)
```

### 6. Creating an Empty DataFrame and Adding Data Later

Initialize an empty DataFrame and add rows to it later.

```python
import pandas as pd

df = pd.DataFrame(columns=['Name', 'Age', 'City'])
df.loc[0] = ['Alice', 25, 'New York']
df.loc[1] = ['Bob', 30, 'Los Angeles']
df.loc[2] = ['Charlie', 35, 'Chicago']
print(df)
```

### 7. Creating a DataFrame from a NumPy Array

Convert a NumPy array to a DataFrame by specifying column names.

```python
import pandas as pd
import numpy as np

data = np.array([['Alice', 25, 'New York'],
                 ['Bob', 30, 'Los Angeles'],
                 ['Charlie', 35, 'Chicago']])

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)
```

Each method allows for flexibility in creating DataFrames from various data structures, making Pandas a versatile tool for data manipulation and analysis.

In [64]:
dt = {"Name":['Ritu','Taniya','abhishek','tanmay','mayank'],
     "Branch":['it','cse','it','ece','cse'],
     "Class":[2,3,4,1,2], 
     "City":['jaipur','agra','vridavan','mathura','gokul']
     }
dt

{'Name': ['Ritu', 'Taniya', 'abhishek', 'tanmay', 'mayank'],
 'Branch': ['it', 'cse', 'it', 'ece', 'cse'],
 'Class': [2, 3, 4, 1, 2],
 'City': ['jaipur', 'agra', 'vridavan', 'mathura', 'gokul']}

In [65]:
# npp.array()
df=pd.DataFrame(dt)#same as arrays in numpy
df

Unnamed: 0,Name,Branch,Class,City
0,Ritu,it,2,jaipur
1,Taniya,cse,3,agra
2,abhishek,it,4,vridavan
3,tanmay,ece,1,mathura
4,mayank,cse,2,gokul


In [66]:
type(df)

pandas.core.frame.DataFrame

In [67]:
# series:single coloumns 
df["Name"]
 

0        Ritu
1      Taniya
2    abhishek
3      tanmay
4      mayank
Name: Name, dtype: object

* what is dtype?
dtype means datatype

In [68]:
type(df['Name'])

pandas.core.series.Series

In [69]:
df.dtypes

Name      object
Branch    object
Class      int64
City      object
dtype: object

* - object dtype means string

In [70]:
df.shape
# 5=>rows
# 4=>columns in dictionary

(5, 4)

In [71]:
df.Name# to directly accessing a column

0        Ritu
1      Taniya
2    abhishek
3      tanmay
4      mayank
Name: Name, dtype: object

In [72]:
df[["Name","Branch","City"]]

Unnamed: 0,Name,Branch,City
0,Ritu,it,jaipur
1,Taniya,cse,agra
2,abhishek,it,vridavan
3,tanmay,ece,mathura
4,mayank,cse,gokul


In [73]:
df[["Branch","Class"]]

Unnamed: 0,Branch,Class
0,it,2
1,cse,3
2,it,4
3,ece,1
4,cse,2


In [74]:
# df[1] this will show keyerror

In [75]:
df[["Name","Branch"]][1:3]

Unnamed: 0,Name,Branch
1,Taniya,cse
2,abhishek,it


In [76]:
df.loc[1:2,["Name","Branch"]]#yha pr stopping point inck=lusive hota hai

Unnamed: 0,Name,Branch
1,Taniya,cse
2,abhishek,it


The expression `df.loc[1:2, ["Name", "Branch"]]` in Pandas is used to select a subset of data from a DataFrame using label-based indexing. Here’s a brief explanation:

- **`df`**: This is your DataFrame.
- **`loc`**: This is the label-based indexing method provided by Pandas.
- **`1:2`**: This specifies the rows to be selected. It includes rows with labels 1 and 2.
- **`["Name", "Branch"]`**: This specifies the columns to be selected by their names.

### Breakdown:
1. **Row Selection (`1:2`)**: 
   - This selects rows with labels from 1 to 2, inclusive. If the DataFrame index is numeric and ordered, this would typically be the second and third rows (since indexing starts at 0).

2. **Column Selection (`["Name", "Branch"]`)**:
   - This selects the columns named "Name" and "Branch".

### Example:

Assume you have the following DataFrame:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Branch': ['CS', 'EE', 'ME']
}

df = pd.DataFrame(data)
print(df)
```

Output:
```
       Name  Age Branch
0     Alice   25     CS
1       Bob   30     EE
2   Charlie   35     ME
```

Using the expression:

```python
subset = df.loc[1:2, ["Name", "Branch"]]
print(subset)
```

Output:
```
       Name Branch
1       Bob     EE
2   Charlie     ME
```

### Summary:

- **`df.loc[1:2, ["Name", "Branch"]]`**: Selects rows 1 and 2, and only the "Name" and "Branch" columns from those rows.
- **Result**: A new DataFrame containing only the specified rows and columns.

This operation is useful for extracting specific parts of your data for further analysis or processing.

In [77]:
df.iloc[1:3,0:2]# in this stopping point will not be inclusive

Unnamed: 0,Name,Branch
1,Taniya,cse
2,abhishek,it


The expression `df.iloc[1:3, 0:2]` in Pandas is used to select a subset of data from a DataFrame using integer-based indexing. Here’s a brief explanation:

- **`df`**: This is your DataFrame.
- **`iloc`**: This method allows for integer-location-based indexing for selection by position.
- **`1:3`**: This specifies the rows to be selected, starting from row index 1 up to (but not including) row index 3. It includes the second and third rows.
- **`0:2`**: This specifies the columns to be selected, starting from column index 0 up to (but not including) column index 2. It includes the first and second columns.

### Example:

Assume you have the following DataFrame:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Branch': ['CS', 'EE', 'ME']
}

df = pd.DataFrame(data)
print(df)
```

Output:
```
       Name  Age Branch
0     Alice   25     CS
1       Bob   30     EE
2   Charlie   35      ME
```

Using the expression:

```python
subset = df.iloc[1:3, 0:2]
print(subset)
```

Output:
```
       Name  Age
1       Bob   30
2   Charlie   35
```

### Summary:
- **`df.iloc[1:3, 0:2]`**: Selects rows 1 and 2, and columns 0 and 1 from those rows.
- **Result**: A new DataFrame containing the specified rows and columns based on their integer positions.

In [78]:
df.iloc[3:5,1:4]

Unnamed: 0,Branch,Class,City
3,ece,1,mathura
4,cse,2,gokul


In [79]:
df.loc[3:5,["Branch","Class","City"]]

Unnamed: 0,Branch,Class,City
3,ece,1,mathura
4,cse,2,gokul


In [80]:
df=pd.read_csv("Used_Bikes.csv")
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
32643,Hero Passion Pro 100cc,39000.0,Delhi,22000.0,First Owner,4.0,100.0,Hero
32644,TVS Apache RTR 180cc,30000.0,Karnal,6639.0,First Owner,9.0,180.0,TVS
32645,Bajaj Avenger Street 220,60000.0,Delhi,20373.0,First Owner,6.0,220.0,Bajaj
32646,Hero Super Splendor 125cc,15600.0,Jaipur,84186.0,First Owner,16.0,125.0,Hero


In [81]:
df.head()#by default it shows top 5 values
df.head(10)

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
5,Yamaha FZs 150cc,53499.0,Delhi,25000.0,First Owner,6.0,150.0,Yamaha
6,Honda CB Hornet 160R ABS DLX,85000.0,Delhi,8200.0,First Owner,3.0,160.0,Honda
7,Hero Splendor Plus Self Alloy 100cc,45000.0,Delhi,12645.0,First Owner,3.0,100.0,Hero
8,Royal Enfield Thunderbird X 350cc,145000.0,Bangalore,9190.0,First Owner,3.0,350.0,Royal Enfield
9,Royal Enfield Classic Desert Storm 500cc,88000.0,Delhi,19000.0,Second Owner,7.0,500.0,Royal Enfield


In [82]:
# df.tail()# default is 5 value that it shows
df.tail(19)

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
32629,Bajaj Platina 100cc,23000.0,Lucknow,20000.0,First Owner,6.0,100.0,Bajaj
32630,Suzuki Slingshot Plus 125cc,32000.0,Lucknow,22697.0,First Owner,7.0,125.0,Suzuki
32631,Yamaha SZ-RR 150cc,20000.0,Kanchipuram,52000.0,First Owner,10.0,150.0,Yamaha
32632,Bajaj Avenger Street 220,55005.0,Godhara,6600.0,First Owner,5.0,220.0,Bajaj
32633,Royal Enfield Classic 350cc,87000.0,Gautam Buddha Nagar,16336.0,First Owner,7.0,350.0,Royal Enfield
32634,Royal Enfield Thunderbird 350cc,70000.0,Mumbai,13858.0,Second Owner,11.0,350.0,Royal Enfield
32635,Suzuki Zeus 125cc,35000.0,Ahmedabad,11885.0,First Owner,12.0,125.0,Suzuki
32636,KTM RC 390cc,196700.0,Mumbai,13216.0,First Owner,4.0,390.0,KTM
32637,Bajaj Pulsar 150cc,25000.0,Delhi,32588.0,First Owner,9.0,150.0,Bajaj
32638,Yamaha Fazer 25 250cc,123000.0,Kadapa,14500.0,First Owner,4.0,250.0,Yamaha


what is filtering operation in panda?

Filtering in Pandas refers to the process of selecting a subset of rows from a DataFrame based on certain conditions. This is typically done using boolean indexing, where a condition or set of conditions is applied to the DataFrame to create a boolean mask. The mask is then used to filter the DataFrame, returning only the rows that meet the specified criteria.

### Example of Filtering Operations in Pandas

#### 1. Filtering Based on a Single Condition

To filter rows where a column meets a specific condition:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
```

Output:
```
      Name  Age     City
2  Charlie   35  Chicago
3    David   40  Houston
```

#### 2. Filtering Based on Multiple Conditions

You can combine multiple conditions using `&` (and) and `|` (or) operators:

```python
# Filter rows where Age is greater than 30 and City is 'Chicago'
filtered_df = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print(filtered_df)
```

Output:
```
      Name  Age     City
2  Charlie   35  Chicago
```

#### 3. Filtering Based on String Conditions

You can filter based on string conditions, such as checking if a column contains a specific substring:

```python
# Filter rows where City contains 'New'
filtered_df = df[df['City'].str.contains('New')]
print(filtered_df)
```

Output:
```
    Name  Age      City
0  Alice   25  New York
```

#### 4. Filtering Using the `query` Method

The `query` method provides a way to filter DataFrames using a query string:

```python
# Filter rows using a query string
filtered_df = df.query('Age > 30 and City == "Chicago"')
print(filtered_df)
```

Output:
```
      Name  Age     City
2  Charlie   35  Chicago
```

#### 5. Filtering with the `isin` Method

To filter rows where a column's value is in a list of values:

```python
# Filter rows where City is either 'New York' or 'Chicago'
filtered_df = df[df['City'].isin(['New York', 'Chicago'])]
print(filtered_df)
```

Output:
```
      Name  Age      City
0    Alice   25  New York
2  Charlie   35   Chicago
```

### Summary

- **Single Condition**: Filter rows based on one condition (e.g., `df[df['Age'] > 30]`).
- **Multiple Conditions**: Combine conditions using `&` and `|` (e.g., `df[(df['Age'] > 30) & (df['City'] == 'Chicago')]`).
- **String Conditions**: Use string methods for filtering (e.g., `df[df['City'].str.contains('New')]`).
- **`query` Method**: Use query strings to filter (e.g., `df.query('Age > 30 and City == "Chicago"')`).
- **`isin` Method**: Filter rows where a column's value is in a list (e.g., `df[df['City'].isin(['New York', 'Chicago'])]`).

Filtering is a powerful tool in Pandas that allows for efficient data manipulation and analysis by extracting relevant subsets of data based on specified criteria.



In [83]:
df['brand'].nunique()


23

In [84]:
df['brand'].unique()


array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [85]:
df['brand'].value_counts()# it shows the number of bikes present of every brand

brand
Bajaj              11213
Hero                6368
Royal Enfield       4178
Yamaha              3916
Honda               2108
Suzuki              1464
TVS                 1247
KTM                 1077
Harley-Davidson      737
Kawasaki              79
Hyosung               64
Benelli               56
Mahindra              55
Triumph               26
Ducati                22
BMW                   16
Jawa                  10
MV                     4
Indian                 3
Ideal                  2
Rajdoot                1
Yezdi                  1
LML                    1
Name: count, dtype: int64

In [86]:
bullet=df[df['brand'] == "Royal Enfield"]
bullet.head()

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
8,Royal Enfield Thunderbird X 350cc,145000.0,Bangalore,9190.0,First Owner,3.0,350.0,Royal Enfield
9,Royal Enfield Classic Desert Storm 500cc,88000.0,Delhi,19000.0,Second Owner,7.0,500.0,Royal Enfield
23,Royal Enfield Classic Chrome 500cc,121700.0,Kalyan,24520.0,First Owner,5.0,500.0,Royal Enfield
36,Royal Enfield Classic 350cc,98800.0,Kochi,39000.0,First Owner,5.0,350.0,Royal Enfield


In [87]:
bullet.shape

(4178, 8)

In [88]:
# brand = royal enfiels
#  age ==less than 2 year
#  owner ==first owner

bullet = df[(df['brand'] == "Royal Enfield") & (df['age'] <= 2) & (df['owner'] == "First Owner")]
bullet.head()


Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
38,Royal Enfield Thunderbird X 500cc,190500.0,Samastipur,4550.0,First Owner,2.0,500.0,Royal Enfield
81,Royal Enfield Interceptor 650cc,260000.0,Navi Mumbai,3800.0,First Owner,2.0,650.0,Royal Enfield
139,Royal Enfield Himalayan 410cc Fi ABS,173300.0,Vadodara,14000.0,First Owner,2.0,410.0,Royal Enfield
157,Royal Enfield Himalayan 410cc Fi ABS,173300.0,Vadodara,14000.0,First Owner,2.0,410.0,Royal Enfield
194,Royal Enfield Electra 350cc,145000.0,Bangalore,4000.0,First Owner,2.0,350.0,Royal Enfield


In [89]:
bullet.shape

(84, 8)

In [90]:
bullet["city"].unique()

array(['Samastipur', 'Navi Mumbai', 'Vadodara', 'Bangalore',
       'Hamirpur(hp)', 'Mumbai', 'Delhi', 'Guwahati', 'Haldwani',
       'Ahmedabad', 'Bardhaman', 'Silchar', 'Sibsagar', 'Kharar',
       'Baripara', 'Sonipat', 'Pune', 'Farukhabad', 'Sultanpur',
       'Hyderabad', 'Gurgaon', 'Faridabad', 'Kota', 'Thane', 'Nellore',
       'Alipore', 'Ghaziabad', 'Noida'], dtype=object)

In [91]:
bullet["city"].nunique()

28

In [92]:
bullet["city"].value_counts()

city
Delhi           18
Mumbai          10
Bangalore        6
Ahmedabad        6
Vadodara         4
Guwahati         4
Faridabad        4
Hyderabad        3
Ghaziabad        3
Sibsagar         2
Bardhaman        2
Navi Mumbai      2
Gurgaon          2
Alipore          2
Farukhabad       2
Kharar           2
Silchar          1
Haldwani         1
Hamirpur(hp)     1
Samastipur       1
Sultanpur        1
Pune             1
Baripara         1
Sonipat          1
Thane            1
Kota             1
Nellore          1
Noida            1
Name: count, dtype: int64

In [93]:
bullet = df[ (df['age'] <= 2) & (df['owner'] == "First Owner")& (df['city']=="jaipur")]
bullet.head()

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand


In [94]:
bullet.nunique()

bike_name     0
price         0
city          0
kms_driven    0
owner         0
age           0
power         0
brand         0
dtype: int64

In [95]:
bullet.value_counts()

Series([], Name: count, dtype: int64)

In [96]:
df['brand'].unique()

array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [97]:
bike =df[(df['brand']=="KTM") | (df['brand']=="Jawa")]
bike

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
33,KTM RC 390cc,180000.0,Pune,17700.0,First Owner,4.0,390.0,KTM
35,KTM Duke 200cc,70000.0,Nashik,100000.0,Second Owner,8.0,200.0,KTM
39,KTM RC 200cc ABS,179000.0,Bangalore,3400.0,First Owner,2.0,200.0,KTM
65,KTM Duke 200cc,94700.0,Baripara,32700.0,First Owner,4.0,200.0,KTM
83,KTM Duke 250cc,130000.0,Gandhidham,17500.0,Second Owner,4.0,250.0,KTM
...,...,...,...,...,...,...,...,...
32541,KTM RC 390cc,196700.0,Mumbai,13216.0,First Owner,4.0,390.0,KTM
32560,KTM RC 390cc,196700.0,Mumbai,13216.0,First Owner,4.0,390.0,KTM
32579,KTM RC 390cc,196700.0,Mumbai,13216.0,First Owner,4.0,390.0,KTM
32598,KTM RC 390cc,196700.0,Mumbai,13216.0,First Owner,4.0,390.0,KTM


In [98]:
bike["brand"].value_counts()

brand
KTM     1077
Jawa      10
Name: count, dtype: int64

In [99]:
if "Jaipur" in bike["city"].unique():
    print("present")
else:
    print('not present')

present


In [100]:
bike["city"].value_counts

<bound method IndexOpsMixin.value_counts of 33             Pune
35           Nashik
39        Bangalore
65         Baripara
83       Gandhidham
            ...    
32541        Mumbai
32560        Mumbai
32579        Mumbai
32598        Mumbai
32636        Mumbai
Name: city, Length: 1087, dtype: object>

In [101]:
bike[bike['city']=='Jaipur']

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
888,KTM RC 390cc,147000.0,Jaipur,15000.0,First Owner,6.0,390.0,KTM
4945,KTM Duke 390cc,200000.0,Jaipur,11700.0,First Owner,4.0,390.0,KTM
4948,KTM RC 390cc,175000.0,Jaipur,10880.0,First Owner,4.0,390.0,KTM
6073,KTM RC 200cc,190000.0,Jaipur,7902.0,First Owner,2.0,200.0,KTM
6159,KTM Duke 250cc,135000.0,Jaipur,12507.0,First Owner,4.0,250.0,KTM
6552,KTM RC 200cc,190000.0,Jaipur,7902.0,First Owner,2.0,200.0,KTM
6734,KTM RC 200cc,128000.0,Jaipur,15000.0,First Owner,4.0,200.0,KTM
7561,KTM Duke 250cc,150000.0,Jaipur,12500.0,First Owner,4.0,250.0,KTM


In [102]:
# population data ==>df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32648 entries, 0 to 32647
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   bike_name   32648 non-null  object 
 1   price       32648 non-null  float64
 2   city        32648 non-null  object 
 3   kms_driven  32648 non-null  float64
 4   owner       32648 non-null  object 
 5   age         32648 non-null  float64
 6   power       32648 non-null  float64
 7   brand       32648 non-null  object 
dtypes: float64(4), object(4)
memory usage: 2.0+ MB


In [103]:
df.duplicated().sum()

np.int64(25324)

In [104]:
df.drop_duplicates(inplace=True)

In [105]:
df.shape

(7324, 8)

In [106]:
unique_record=df.drop_duplicates()
unique_record


Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [107]:
bullet =df[df["brand"] == "Royal Enfield"]


In [108]:
bullet.shape

(1346, 8)

In [109]:
bullet.sort_values(by="price").head(10)



Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
5811,Royal Enfield Thunderbird 350cc,33500.0,Delhi,49463.0,First Owner,16.0,350.0,Royal Enfield
7038,Royal Enfield Bullet Electra 350cc,35000.0,Delhi,60000.0,Fourth Owner Or More,18.0,350.0,Royal Enfield
9183,Royal Enfield Thunderbird 350cc,35800.0,Bangalore,90408.0,Third Owner,18.0,350.0,Royal Enfield
4664,Royal Enfield Bullet Electra 350cc,41000.0,Noida,120000.0,First Owner,17.0,350.0,Royal Enfield
8475,Royal Enfield Thunderbird 350cc,45000.0,Delhi,45710.0,First Owner,16.0,350.0,Royal Enfield
5918,Royal Enfield Thunderbird 350cc,45000.0,Bangalore,93108.0,Third Owner,18.0,350.0,Royal Enfield
2371,Royal Enfield Bullet 350 cc,45000.0,Gurgaon,40000.0,Second Owner,20.0,350.0,Royal Enfield
6489,Royal Enfield Thunderbird 350cc,45918.0,Bangalore,51396.0,Second Owner,12.0,350.0,Royal Enfield
385,Royal Enfield Thunderbird 350cc,46000.0,Chennai,35000.0,First Owner,16.0,350.0,Royal Enfield
9182,Royal Enfield Thunderbird 350cc,47000.0,Pune,60045.0,First Owner,12.0,350.0,Royal Enfield


In [110]:
bullet.sort_values(by="price",ascending=False).head(10)



Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
4912,Royal Enfield Continental GT 650cc,285000.0,Hyderabad,4500.0,First Owner,2.0,650.0,Royal Enfield
277,Royal Enfield Interceptor 650cc,280000.0,Bangalore,1500.0,First Owner,2.0,650.0,Royal Enfield
2228,Royal Enfield Interceptor 650cc,280000.0,Mumbai,5000.0,First Owner,2.0,650.0,Royal Enfield
1931,Royal Enfield Interceptor 650cc,270000.0,Ahmedabad,6500.0,First Owner,2.0,650.0,Royal Enfield
5912,Royal Enfield Interceptor 650cc,265500.0,Nellore,12000.0,First Owner,2.0,650.0,Royal Enfield
2332,Royal Enfield Interceptor 650cc,265000.0,Delhi,8500.0,First Owner,2.0,650.0,Royal Enfield
428,Royal Enfield Interceptor 650cc,265000.0,Bangalore,12900.0,First Owner,2.0,650.0,Royal Enfield
1396,Royal Enfield Interceptor 650cc,265000.0,Delhi,11000.0,First Owner,2.0,650.0,Royal Enfield
81,Royal Enfield Interceptor 650cc,260000.0,Navi Mumbai,3800.0,First Owner,2.0,650.0,Royal Enfield
1976,Royal Enfield Standard 350cc,250000.0,Chennai,1400.0,Second Owner,27.0,350.0,Royal Enfield


In [111]:
# find out or filter all the yamaha brand bikes top ten least kilometer driven 
yamaha =df[(df["brand"] == "Yamaha")&(df['city']=="Delhi")] 
# yamaha=yamaha[['city']== 'Jaipur'] error

yamaha.sort_values(by="kms_driven").head(10)


Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
628,Yamaha FZs 150cc,77500.0,Delhi,45.0,First Owner,3.0,150.0,Yamaha
8693,Yamaha FZ25 ABS 250cc,142000.0,Delhi,233.0,First Owner,2.0,250.0,Yamaha
4200,Yamaha YZF-R1 1000cc,700000.0,Delhi,285.0,First Owner,3.0,1000.0,Yamaha
8196,Yamaha FZ25 250cc,110000.0,Delhi,800.0,First Owner,3.0,250.0,Yamaha
276,Yamaha Saluto 125cc Disc Special Edition,62000.0,Delhi,1123.0,First Owner,2.0,125.0,Yamaha
2309,Yamaha MT-15 150cc,125000.0,Delhi,1500.0,First Owner,2.0,150.0,Yamaha
8254,Yamaha YZF-R15 150cc,102000.0,Delhi,2000.0,First Owner,5.0,150.0,Yamaha
9284,Yamaha FZ25 250cc,115800.0,Delhi,2126.0,First Owner,3.0,250.0,Yamaha
5648,Yamaha YZF-R15 2.0 150cc,66000.0,Delhi,2465.0,First Owner,9.0,150.0,Yamaha
148,Yamaha YZF-R15 V3 150cc,140000.0,Delhi,2473.0,First Owner,2.0,150.0,Yamaha


In [112]:
yamaha.to_csv('yamaha.csv',index=False)# export csv file

In [113]:
yamaha.drop(['brand'],axis='columns')

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power
5,Yamaha FZs 150cc,53499.0,Delhi,25000.0,First Owner,6.0,150.0
24,Yamaha FZ V 2.0 150cc,45000.0,Delhi,23000.0,First Owner,6.0,150.0
112,Yamaha YZF-R15 S 150cc,63000.0,Delhi,16000.0,First Owner,5.0,150.0
147,Yamaha YZF-R15 2.0 150cc,68500.0,Delhi,68500.0,Second Owner,7.0,150.0
148,Yamaha YZF-R15 V3 150cc,140000.0,Delhi,2473.0,First Owner,2.0,150.0
...,...,...,...,...,...,...,...
9152,Yamaha FZs 150cc,38000.0,Delhi,40826.0,First Owner,8.0,150.0
9205,Yamaha FZs 150cc,31000.0,Delhi,58000.0,First Owner,10.0,150.0
9284,Yamaha FZ25 250cc,115800.0,Delhi,2126.0,First Owner,3.0,250.0
9318,Yamaha Fazer 150cc,37000.0,Delhi,77596.0,First Owner,8.0,150.0


In [114]:
# bike name,price,age,kms_driven
# yamaha.drop()
# how to add or delete colomns in the dataframe
# yamaha.drop('owner',axis="columns") for single columns

In [115]:
# for multiple columns
yamaha.drop(['city','power'],axis='columns')

Unnamed: 0,bike_name,price,kms_driven,owner,age,brand
5,Yamaha FZs 150cc,53499.0,25000.0,First Owner,6.0,Yamaha
24,Yamaha FZ V 2.0 150cc,45000.0,23000.0,First Owner,6.0,Yamaha
112,Yamaha YZF-R15 S 150cc,63000.0,16000.0,First Owner,5.0,Yamaha
147,Yamaha YZF-R15 2.0 150cc,68500.0,68500.0,Second Owner,7.0,Yamaha
148,Yamaha YZF-R15 V3 150cc,140000.0,2473.0,First Owner,2.0,Yamaha
...,...,...,...,...,...,...
9152,Yamaha FZs 150cc,38000.0,40826.0,First Owner,8.0,Yamaha
9205,Yamaha FZs 150cc,31000.0,58000.0,First Owner,10.0,Yamaha
9284,Yamaha FZ25 250cc,115800.0,2126.0,First Owner,3.0,Yamaha
9318,Yamaha Fazer 150cc,37000.0,77596.0,First Owner,8.0,Yamaha


In [116]:
yamaha.drop('owner',axis=1)

Unnamed: 0,bike_name,price,city,kms_driven,age,power,brand
5,Yamaha FZs 150cc,53499.0,Delhi,25000.0,6.0,150.0,Yamaha
24,Yamaha FZ V 2.0 150cc,45000.0,Delhi,23000.0,6.0,150.0,Yamaha
112,Yamaha YZF-R15 S 150cc,63000.0,Delhi,16000.0,5.0,150.0,Yamaha
147,Yamaha YZF-R15 2.0 150cc,68500.0,Delhi,68500.0,7.0,150.0,Yamaha
148,Yamaha YZF-R15 V3 150cc,140000.0,Delhi,2473.0,2.0,150.0,Yamaha
...,...,...,...,...,...,...,...
9152,Yamaha FZs 150cc,38000.0,Delhi,40826.0,8.0,150.0,Yamaha
9205,Yamaha FZs 150cc,31000.0,Delhi,58000.0,10.0,150.0,Yamaha
9284,Yamaha FZ25 250cc,115800.0,Delhi,2126.0,3.0,250.0,Yamaha
9318,Yamaha Fazer 150cc,37000.0,Delhi,77596.0,8.0,150.0,Yamaha


In [117]:
# jaipur_bullet=  df[(df['brand'] == "Royal Enfield") & (df['age'] <= 2) & (df['city'] == "")]
# jaipur_bullet.head()

In [118]:
# _____________assingment_____________________________
# thorughly read and practise all the operations
# read random excel file and make change in pandas
# 

In [119]:
import pandas as pd
import numpy as np

In [120]:
# df=pd.read.

In [121]:
data = {'A':[2,5,np.nan,8,np.nan,9],
       'B':[np.nan,45,np.nan,89,63,np.nan], 
       'C':[np.nan,74,np.nan,np.nan,85,4], 
       'D':[10,20,30,40,50,60]}
data


{'A': [2, 5, nan, 8, nan, 9],
 'B': [nan, 45, nan, 89, 63, nan],
 'C': [nan, 74, nan, nan, 85, 4],
 'D': [10, 20, 30, 40, 50, 60]}

In [122]:
df2=pd.DataFrame(data)
df2

Unnamed: 0,A,B,C,D
0,2.0,,,10
1,5.0,45.0,74.0,20
2,,,,30
3,8.0,89.0,,40
4,,63.0,85.0,50
5,9.0,,4.0,60


In [123]:
# NaN is none value and ml algo doesn't accept these values
# remove the records
# fill the record

In [124]:
df2.dropna() ##by default remove all those rows that containing missing values

Unnamed: 0,A,B,C,D
1,5.0,45.0,74.0,20


In [125]:
df2.dropna(axis='columns')

Unnamed: 0,D
0,10
1,20
2,30
3,40
4,50
5,60


In [126]:
df2

Unnamed: 0,A,B,C,D
0,2.0,,,10
1,5.0,45.0,74.0,20
2,,,,30
3,8.0,89.0,,40
4,,63.0,85.0,50
5,9.0,,4.0,60


In [127]:
print("Total missing values in your df:",df2.isnull().sum().sum())

Total missing values in your df: 8


In [128]:
df2.isnull().sum()/df2.shape[0]*100
# percentage amount of every columns

A    33.333333
B    50.000000
C    50.000000
D     0.000000
dtype: float64

In [129]:
# filling the records
df2.fillna(500) # use inplace for making chnges permanent or saved 

Unnamed: 0,A,B,C,D
0,2.0,500.0,500.0,10
1,5.0,45.0,74.0,20
2,500.0,500.0,500.0,30
3,8.0,89.0,500.0,40
4,500.0,63.0,85.0,50
5,9.0,500.0,4.0,60


In [130]:
df2

Unnamed: 0,A,B,C,D
0,2.0,,,10
1,5.0,45.0,74.0,20
2,,,,30
3,8.0,89.0,,40
4,,63.0,85.0,50
5,9.0,,4.0,60


In [131]:
# A =500
# B=600
# C=700
df2['A'].fillna(500)

0      2.0
1      5.0
2    500.0
3      8.0
4    500.0
5      9.0
Name: A, dtype: float64

In [132]:
df2['B'].fillna(600)

0    600.0
1     45.0
2    600.0
3     89.0
4     63.0
5    600.0
Name: B, dtype: float64

In [133]:
df2['A'].mean()

np.float64(6.0)

In [134]:
df2['A'].fillna(df2['A'].mean())

0    2.0
1    5.0
2    6.0
3    8.0
4    6.0
5    9.0
Name: A, dtype: float64

In [135]:
df2['A'].fillna(df2['A'].median())

0    2.0
1    5.0
2    6.5
3    8.0
4    6.5
5    9.0
Name: A, dtype: float64

In [136]:
# df2['A'].mode() most occurence of the something is mode


<!-- gropuby -->

# Groupby

In Python, particularly when using the pandas library, the `groupby` method is used to split data into groups based on some criteria. It’s a powerful tool for data analysis, allowing you to group data, apply some operations to each group independently, and then combine the results back together. This is often referred to as the “split-apply-combine” strategy.

### Key Concepts of `groupby`

1. **Splitting**: The data is divided into groups based on some criteria. This could be based on the values of one or more columns.
2. **Applying**: A function is applied to each group independently. This could be an aggregation function like `sum`, `mean`, `count`, etc., or any custom function.
3. **Combining**: The results of the function applications are combined into a new data structure.

### Usage

Here is a basic example to illustrate how `groupby` works:

```python
import pandas as pd

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward', 'Fiona'],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
    'Age': [24, 27, 22, 32, 29, 34],
    'Salary': [70000, 80000, 65000, 90000, 85000, 95000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Group by 'City' column
grouped = df.groupby('City')

# Calculate mean age and salary for each city
mean_values = grouped.mean()

print(mean_values)
```

### Output

```
             Age   Salary
City                      
Los Angeles  31.0  88333.33
New York     25.0  73333.33
```

### Explanation

1. **Import pandas**:
   ```python
   import pandas as pd
   ```

2. **Create a DataFrame**:
   ```python
   data = {
       'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward', 'Fiona'],
       'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
       'Age': [24, 27, 22, 32, 29, 34],
       'Salary': [70000, 80000, 65000, 90000, 85000, 95000]
   }
   df = pd.DataFrame(data)
   ```

3. **Group by 'City'**:
   ```python
   grouped = df.groupby('City')
   ```

4. **Calculate mean values**:
   ```python
   mean_values = grouped.mean()
   print(mean_values)
   ```

### Common Aggregation Functions

- `sum()`: Compute sum of group values.
- `mean()`: Compute mean of group values.
- `std()`: Compute standard deviation of group values.
- `min()`: Compute min of group values.
- `max()`: Compute max of group values.
- `count()`: Compute count of group values.
- `size()`: Compute the size of each group.

### Custom Functions

You can also apply custom functions using the `apply` method:

```python
# Custom function to calculate range
def data_range(x):
    return x.max() - x.min()

# Apply custom function to each group
range_values = grouped['Salary'].apply(data_range)
print(range_values)
```

### Output

```
City
Los Angeles    15000
New York       20000
Name: Salary, dtype: int64
```

In summary, `groupby` in pandas is a versatile and powerful tool for grouping data and performing operations on these groups, which is essential for data analysis and manipulation.

In [137]:
df=pd.read_csv("Used_Bikes.csv")
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
32643,Hero Passion Pro 100cc,39000.0,Delhi,22000.0,First Owner,4.0,100.0,Hero
32644,TVS Apache RTR 180cc,30000.0,Karnal,6639.0,First Owner,9.0,180.0,TVS
32645,Bajaj Avenger Street 220,60000.0,Delhi,20373.0,First Owner,6.0,220.0,Bajaj
32646,Hero Super Splendor 125cc,15600.0,Jaipur,84186.0,First Owner,16.0,125.0,Hero


In [138]:
df['price'].max()

np.float64(1900000.0)

In [139]:
df[(df['brand']=='TVS')]['price'].min()

np.float64(5800.0)

In [140]:
brand_group=df.groupby('brand')
brand_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B7949F830>

In [141]:
brand_group.get_group('Yamaha')

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
5,Yamaha FZs 150cc,53499.0,Delhi,25000.0,First Owner,6.0,150.0,Yamaha
10,Yamaha YZF-R15 2.0 150cc,72000.0,Bangalore,20000.0,First Owner,7.0,150.0,Yamaha
11,Yamaha FZ25 250cc,95000.0,Bangalore,9665.0,First Owner,4.0,250.0,Yamaha
24,Yamaha FZ V 2.0 150cc,45000.0,Delhi,23000.0,First Owner,6.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
32610,Yamaha FZ 150cc,45000.0,Chennai,18742.0,First Owner,6.0,150.0,Yamaha
32612,Yamaha YZF-R15 2.0 150cc,55000.0,Rupnagar,27000.0,First Owner,9.0,150.0,Yamaha
32622,Yamaha Fazer 150cc,25000.0,Delhi,65000.0,First Owner,12.0,150.0,Yamaha
32631,Yamaha SZ-RR 150cc,20000.0,Kanchipuram,52000.0,First Owner,10.0,150.0,Yamaha


In [142]:
brand_group[['price']].min()

Unnamed: 0_level_0,price
brand,Unnamed: 1_level_1
BMW,255000.0
Bajaj,6400.0
Benelli,110700.0
Ducati,380000.0
Harley-Davidson,250000.0
Hero,5000.0
Honda,10000.0
Hyosung,120000.0
Ideal,100000.0
Indian,700000.0


In [143]:
brand_group['price'].max()

brand
BMW                1800000.0
Bajaj               195000.0
Benelli             785000.0
Ducati             1500000.0
Harley-Davidson    1100000.0
Hero                104000.0
Honda               800000.0
Hyosung             493500.0
Ideal               100000.0
Indian             1900000.0
Jawa                223000.0
KTM                 860000.0
Kawasaki           1100000.0
LML                   4400.0
MV                 1500000.0
Mahindra            175000.0
Rajdoot              75000.0
Royal Enfield       285000.0
Suzuki             1260000.0
TVS                 224000.0
Triumph            1300000.0
Yamaha             1550000.0
Yezdi                68000.0
Name: price, dtype: float64

In [144]:
brand_group[['price']].max()

Unnamed: 0_level_0,price
brand,Unnamed: 1_level_1
BMW,1800000.0
Bajaj,195000.0
Benelli,785000.0
Ducati,1500000.0
Harley-Davidson,1100000.0
Hero,104000.0
Honda,800000.0
Hyosung,493500.0
Ideal,100000.0
Indian,1900000.0


In [145]:
brand_group[['price']].median()

Unnamed: 0_level_0,price
brand,Unnamed: 1_level_1
BMW,340000.0
Bajaj,42000.0
Benelli,232500.0
Ducati,860500.0
Harley-Davidson,450000.0
Hero,18000.0
Honda,43000.0
Hyosung,227500.0
Ideal,100000.0
Indian,700000.0


In [146]:
brand_group['price'].agg(min_price='min',max_price='max',avg_price='mean')

Unnamed: 0_level_0,min_price,max_price,avg_price
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BMW,255000.0,1800000.0,598750.0
Bajaj,6400.0,195000.0,48331.27
Benelli,110700.0,785000.0,294200.0
Ducati,380000.0,1500000.0,935545.5
Harley-Davidson,250000.0,1100000.0,452998.8
Hero,5000.0,104000.0,23829.45
Honda,10000.0,800000.0,59230.47
Hyosung,120000.0,493500.0,249167.8
Ideal,100000.0,100000.0,100000.0
Indian,700000.0,1900000.0,1100000.0


In [147]:
# quiz for each and every brand min and maximum kmdriven 

In [148]:
brand_group=df.groupby('kms_driven')
brand_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B7948FFB0>

In [149]:
brand_group['kms_driven'].agg(min_price='min',max_price='max',avg_price='mean')

Unnamed: 0_level_0,min_price,max_price,avg_price
kms_driven,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1.0,1.0,1.0
3.0,3.0,3.0,3.0
22.0,22.0,22.0,22.0
23.0,23.0,23.0,23.0
30.0,30.0,30.0,30.0
...,...,...,...
566931.0,566931.0,566931.0,566931.0
646000.0,646000.0,646000.0,646000.0
654984.0,654984.0,654984.0,654984.0
717794.0,717794.0,717794.0,717794.0


In [150]:
brand_group=df.groupby('owner')
brand_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B7948E270>

In [151]:
brand_group['kms_driven'].agg(min_price='min',max_price='max',avg_price='mean')

Unnamed: 0_level_0,min_price,max_price,avg_price
owner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First Owner,3.0,750000.0,26474.521793
Fourth Owner Or More,2009.0,60000.0,29173.333333
Second Owner,1.0,300000.0,24465.370125
Third Owner,23.0,101250.0,34606.138889


In [152]:
cat_col=df.select_dtypes(include='O')
cat_col.head()

Unnamed: 0,bike_name,city,owner,brand
0,TVS Star City Plus Dual Tone 110cc,Ahmedabad,First Owner,TVS
1,Royal Enfield Classic 350cc,Delhi,First Owner,Royal Enfield
2,Triumph Daytona 675R,Delhi,First Owner,Triumph
3,TVS Apache RTR 180cc,Bangalore,First Owner,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,Bangalore,First Owner,Yamaha


In [153]:
num_col=df.select_dtypes(exclude='O')
num_col.head()



Unnamed: 0,price,kms_driven,age,power
0,35000.0,17654.0,3.0,110.0
1,119900.0,11000.0,4.0,350.0
2,600000.0,110.0,8.0,675.0
3,65000.0,16329.0,4.0,180.0
4,80000.0,10000.0,3.0,150.0


In [154]:
# adding two different dataframe

In [155]:
pd.concat([cat_col,num_col],axis='columns')

Unnamed: 0,bike_name,city,owner,brand,price,kms_driven,age,power
0,TVS Star City Plus Dual Tone 110cc,Ahmedabad,First Owner,TVS,35000.0,17654.0,3.0,110.0
1,Royal Enfield Classic 350cc,Delhi,First Owner,Royal Enfield,119900.0,11000.0,4.0,350.0
2,Triumph Daytona 675R,Delhi,First Owner,Triumph,600000.0,110.0,8.0,675.0
3,TVS Apache RTR 180cc,Bangalore,First Owner,TVS,65000.0,16329.0,4.0,180.0
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,Bangalore,First Owner,Yamaha,80000.0,10000.0,3.0,150.0
...,...,...,...,...,...,...,...,...
32643,Hero Passion Pro 100cc,Delhi,First Owner,Hero,39000.0,22000.0,4.0,100.0
32644,TVS Apache RTR 180cc,Karnal,First Owner,TVS,30000.0,6639.0,9.0,180.0
32645,Bajaj Avenger Street 220,Delhi,First Owner,Bajaj,60000.0,20373.0,6.0,220.0
32646,Hero Super Splendor 125cc,Jaipur,First Owner,Hero,15600.0,84186.0,16.0,125.0


In [156]:
#_____assingment_______ pd.merge()_________________
# what are joins?
# pd.merge() practice this

# WHAT IS POPULATION DATA?
Population data in pandas refers to demographic data organized in a tabular format within a pandas DataFrame, enabling efficient data manipulation and analysis. A DataFrame in pandas is a versatile data structure that can store various types of demographic information, such as age, sex, geographic location, income, education level, occupation, marital status, and more.

Using pandas, you can easily create, view, and manipulate population data. For instance, you can create a DataFrame from a dictionary of lists where each key-value pair represents a column of demographic data:

```python
import pandas as pd

data = {
    'Name': ['Amit', 'Sunita', 'Raj', 'Priya', 'Karan'],
    'Age': [25, 30, 22, 28, 35],
    'Sex': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Place': ['Delhi', 'Mumbai', 'Kolkata', 'Chennai', 'Bangalore'],
    'Married': ['No', 'Yes', 'No', 'Yes', 'Yes'],
    'Occupation': ['Engineer', 'Doctor', 'Student', 'Teacher', 'Business'],
    'Income': [70000, 85000, 30000, 65000, 90000],
    'Caste': ['General', 'OBC', 'SC', 'ST', 'OBC']
}

df = pd.DataFrame(data)
print(df.head())
```

Once the DataFrame is created, you can perform various operations such as selecting specific columns or rows, filtering data based on conditions, aggregating data, and more. For example, you can select specific columns or filter data to find all individuals who are married and over 30 years old:

```python
married_over_30 = df[(df['Married'] == 'Yes') & (df['Age'] > 30)]
print(married_over_30)
```

Pandas makes it easy to handle large datasets, perform complex data analysis, and visualize demographic trends, making it a powerful tool for working with population data.

# TYPES OF DATA?
In pandas, data can be categorized into several types based on its structure and the nature of the values it holds. Understanding these data types is crucial for effective data manipulation and analysis. Here are the primary types of data in pandas:

1. **Series**
   - A one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.).
   - Each element in a Series is associated with a label, also known as an index.
   - Example:
     ```python
     import pandas as pd
     s = pd.Series([1, 2, 3, 4, 5])
     print(s)
     ```

2. **DataFrame**
   - A two-dimensional labeled data structure with columns of potentially different types.
   - Essentially a table where each column can be of a different data type.
   - Example:
     ```python
     data = {
         'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 35],
         'City': ['New York', 'Los Angeles', 'Chicago']
     }
     df = pd.DataFrame(data)
     print(df)
     ```

3. **Index**
   - An immutable array-like structure that holds the labels for Series and DataFrame rows and columns.
   - Can be created using a variety of methods including from arrays and other sequences.
   - Example:
     ```python
     idx = pd.Index([1, 2, 3])
     print(idx)
     ```

### Data Types within Series and DataFrames

1. **Numeric Types**
   - **int64**: Integer values.
   - **float64**: Floating-point values.
   - Example:
     ```python
     df['Age'] = df['Age'].astype('int64')
     df['Salary'] = df['Salary'].astype('float64')
     ```

2. **String/Object Types**
   - **object**: Generally used for string values, but can also hold any Python object.
   - Example:
     ```python
     df['Name'] = df['Name'].astype('object')
     ```

3. **Boolean Type**
   - **bool**: Boolean values (True or False).
   - Example:
     ```python
     df['Is_Active'] = df['Is_Active'].astype('bool')
     ```

4. **Datetime Type**
   - **datetime64[ns]**: Date and time values.
   - Example:
     ```python
     df['Date'] = pd.to_datetime(df['Date'])
     ```

5. **Categorical Type**
   - **category**: Used for categorical variables that can take on a limited, fixed number of possible values.
   - Example:
     ```python
     df['Category'] = df['Category'].astype('category')
     ```

### Example DataFrame with Various Data Types

```python
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000.0, 80000.0, 90000.0],
    'Is_Active': [True, False, True],
    'Join_Date': ['2020-01-01', '2019-06-15', '2018-03-23'],
    'Category': ['A', 'B', 'A']
}

df = pd.DataFrame(data)

# Convert data types
df['Age'] = df['Age'].astype('int64')
df['Salary'] = df['Salary'].astype('float64')
df['Is_Active'] = df['Is_Active'].astype('bool')
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
df['Category'] = df['Category'].astype('category')

print(df.dtypes)
```

This script creates a DataFrame and explicitly sets different data types for each column, demonstrating how to handle various types of data in pandas.