# Data Transformation

Once your data is clean, the next step is to **reshape, reformat, and reorder** it as needed for analysis. Pandas gives you plenty of flexible tools to do this.

---

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv("Actor-data.csv")

In [6]:
df

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Shah Rukh Khan,Pathaan,2023,Action,1050,7.2
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9


## Sorting & Ranking

### Sort by Values

```python
df.sort_values("Age")                   # Ascending sort
df.sort_values("Age", ascending=False)  # Descending
df.sort_values(["Age", "Salary"])       # Sort by multiple columns
```
df.sort_values(["Age", "Salary"]) sorts the DataFrame first by the "Age" column, and if there are ties (i.e., two or more rows with the same "Age"), it will sort by the "Salary" column.


In [7]:
df.sort_values("Year")

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6


In [9]:
df.sort_values("Year", ascending=False)

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Shah Rukh Khan,Pathaan,2023,Action,1050,7.2
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0


In [10]:
df.sort_values(["Year", "IMDb"])

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6


In [23]:
df2 = df.sort_values(["Year", "IMDb"]).copy()

In [24]:
df2

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6


### Reset Index
If you want the index to start from 0 and be sequential, you can reset it using reset_index()
```python
df.reset_index(drop=True, inplace=True)  # Reset the index and drop the old index
```

In [54]:
 df2.reset_index()

Unnamed: 0,index,Film,Superstar,Year,Genre,IMDb,BoxOffice(INR Crore)
0,2,Dangal,Aamir Khan,2016,Biography,8.4,2024
1,1,Tiger Zinda Hai,Salman Khan,2017,Action,6.0,565
2,10,Badrinath Ki Dulhania,Varun Dhawan,2017,Romantic Comedy,6.1,201
3,4,Padmaavat,Ranveer Singh,2018,Historical,7.0,585
4,6,Stree,Rajkummar Rao,2018,Horror Comedy,7.5,180
5,5,Andhadhun,Ayushmann Khurrana,2018,Thriller,8.3,111
6,7,War,Hrithik Roshan,2019,Action,6.5,475
7,8,Good Newwz,Akshay Kumar,2019,Comedy,7.0,318
8,11,Uri: The Surgical Strike,Vicky Kaushal,2019,Action,8.2,342
9,3,Brahmastra,Ranbir Kapoor,2022,Fantasy,5.6,431


In [55]:
df2.reset_index(drop=True, inplace=True)

In [56]:
df2

Unnamed: 0,Film,Superstar,Year,Genre,IMDb,BoxOffice(INR Crore)
0,Dangal,Aamir Khan,2016,Biography,8.4,2024
1,Tiger Zinda Hai,Salman Khan,2017,Action,6.0,565
2,Badrinath Ki Dulhania,Varun Dhawan,2017,Romantic Comedy,6.1,201
3,Padmaavat,Ranveer Singh,2018,Historical,7.0,585
4,Stree,Rajkummar Rao,2018,Horror Comedy,7.5,180
5,Andhadhun,Ayushmann Khurrana,2018,Thriller,8.3,111
6,War,Hrithik Roshan,2019,Action,6.5,475
7,Good Newwz,Akshay Kumar,2019,Comedy,7.0,318
8,Uri: The Surgical Strike,Vicky Kaushal,2019,Action,8.2,342
9,Brahmastra,Ranbir Kapoor,2022,Fantasy,5.6,431


### Sort by Index

```python
df.sort_index()
```
The df.sort_index() function is used to sort the DataFrame based on its index values. If the index is not in a sequential order (e.g., you have dropped rows or performed other operations that change the index), you can use sort_index() to restore it to a sorted order.

In [29]:
df2.sort_index()

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Shah Rukh Khan,Pathaan,2023,Action,1050,7.2
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9


### Ranking
The .rank() function in pandas is used to assign ranks to numeric values in a column, like scores or points. By default, it gives the average rank to tied values, which can result in decimal numbers. For example, if two people share the top score, they both get a rank of 1.5. You can customize the ranking behavior using the method parameter. One useful option is method='dense', which assigns the same rank to ties but doesn’t leave gaps in the ranking sequence. This is helpful when you want a clean, consecutive ranking system without skips.
```python
df["Rank"] = df["Score"].rank()                 # Default: average method
df["Rank"] = df["Score"].rank(method="dense")   # 1, 2, 2, 3
```

---

In [39]:
df2["Rank"]= df2["IMDb"].rank(ascending=False)  # Default : average method

In [40]:
df2

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb,Rank
2,Aamir Khan,Dangal,2016,Biography,2024,8.4,1.0
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0,10.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1,9.0
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0,6.5
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5,4.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3,2.0
7,Hrithik Roshan,War,2019,Action,475,6.5,8.0
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0,6.5
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2,3.0
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6,12.0


In [41]:
df2["Rank"] = df2["IMDb"].rank(ascending=False, method="dense") # 1, 2, 2, 3

In [42]:
df2

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb,Rank
2,Aamir Khan,Dangal,2016,Biography,2024,8.4,1.0
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0,9.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1,8.0
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0,6.0
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5,4.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3,2.0
7,Hrithik Roshan,War,2019,Action,475,6.5,7.0
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0,6.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2,3.0
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6,11.0


## Renaming Columns & Index

```python
df.rename(columns={"oldName": "newName"}, inplace=True)
df.rename(index={0: "row1", 1: "row2"}, inplace=True)
```

To rename all columns:

```python
df.columns = ["Name", "Age", "City"]
```

---

In [45]:
df2.rename(columns={"Actor": "Superstar"}, inplace=True)

In [46]:
df2

Unnamed: 0,Superstar,Film,Year,Genre,BoxOffice(INR Crore),IMDb,Rank
2,Aamir Khan,Dangal,2016,Biography,2024,8.4,1.0
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0,9.0
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1,8.0
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0,6.0
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5,4.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3,2.0
7,Hrithik Roshan,War,2019,Action,475,6.5,7.0
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0,6.0
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2,3.0
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6,11.0


In [52]:
df2 = df2[["Film", "Superstar", "Year", "Genre", "IMDb", "BoxOffice(INR Crore)"]]

In [53]:
df2

Unnamed: 0,Film,Superstar,Year,Genre,IMDb,BoxOffice(INR Crore)
2,Dangal,Aamir Khan,2016,Biography,8.4,2024
1,Tiger Zinda Hai,Salman Khan,2017,Action,6.0,565
10,Badrinath Ki Dulhania,Varun Dhawan,2017,Romantic Comedy,6.1,201
4,Padmaavat,Ranveer Singh,2018,Historical,7.0,585
6,Stree,Rajkummar Rao,2018,Horror Comedy,7.5,180
5,Andhadhun,Ayushmann Khurrana,2018,Thriller,8.3,111
7,War,Hrithik Roshan,2019,Action,6.5,475
8,Good Newwz,Akshay Kumar,2019,Comedy,7.0,318
11,Uri: The Surgical Strike,Vicky Kaushal,2019,Action,8.2,342
3,Brahmastra,Ranbir Kapoor,2022,Fantasy,5.6,431


## Changing Column Order

Just pass a new list of column names:

```python
df = df[["City", "Name", "Age"]]   # Reorder as desired
```

You can also move one column to the front:

```python
cols = ["Name"] + [col for col in df.columns if col != "Name"]
df = df[cols]
```

---

## Summary

- Sort, rank, and rename to prepare your data    
- Reordering and reshaping are key for EDA and visualization