In [None]:
#!pip install polars -q

In [2]:
import polars as pl
from warnings import filterwarnings

filterwarnings('ignore')

# I. <strong style="color:#5e17eb"> Creating Data Frame </strong>


## <strong style="color:#5e17eb"> From List </strong>


In [None]:
import polars as pl

df = pl.DataFrame([
    ["Abby", "Santa Barbara", 5],
    ["Joe", "Los Angeles", 4]
], schema=["name", "city", "stars"])

df

## <strong style="color:#5e17eb"> From Dictionary </strong>


In [None]:
data = {
    "name": ["Abby", "Joe"],
    "city": ["Santa Barbara", "Los Angeles"],
    "stars": [5, 4]
}

df = pl.DataFrame(data)
df

## <strong style="color:#5e17eb">  From a CSV file</strong>


In [None]:
df = pl.read_csv(r"C:\Users\Rudra\Desktop\yelp\data\business.csv")
df.sample()

## <strong style="color:#5e17eb"> Lazy vs Eager Execution  </strong>


<strong style="color:#5e17eb">  Lazy vs Eager Execution </strong>

- `Lary` 
  - Doesn’t load the entire data into memory.
  - collect(): executes the operations and brings result into memory.
  - When to used -> when the dataset is large (in GB/TB size) and you have low memory
  - How to used -> scan_
- `Eager`
  - When to used -> When we have the small dataset or when we can load whole dataset into our memory
  - How to used -> read_
  - Immediate Execution


What we used in this series?
- I have the 8GB ram so we can load the the this business dataset into our memory.
- For fast learning, now we used the eager mood. 
- Later in this series we go in another mood.

> Before decide the mode used the scan then check the size.

|MOOD| When to used |
|------|-------------|
|Lary|When we can't load whole dataset in memory |
|Eager|When we load whole dataset in memory|

In [None]:
df.estimated_size(unit='mb')

## <strong style="color:#5e17eb"> From a JSON File  </strong>


In [None]:
import os
filepath = r"C:\Users\Rudra\Desktop\yelp\big-data\yelp_academic_dataset_business.json"
if os.path.exists(filepath):
    print("File exists!")
else:
    print("File not found at specified path.")


In [None]:
#df = pl.read_json(r"C:\Users\Rudra\Desktop\yelp\big-data\yelp_academic_dataset_business.json")
df = pl.read_ndjson(r"C:\Users\Rudra\Desktop\yelp\big-data\yelp_academic_dataset_business.json")
df.sample()

## <strong style="color:#5e17eb"> From Parquet </strong>


<strong style="color:#5e17eb"> Parquet </strong>
- Polars + Parquet == More speed
- So we converted all json -> csv -> parquet file 

In [None]:
# f.write_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\business.parquet")

In [3]:
df = pl.read_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\business.parquet")

In [4]:
df.sample()

business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
str,str,str,str,str,str,f64,f64,f64,i64,i64,str,str,str
"""eP-qr9WjtfDpTlisqV718w""","""The Sandbox""","""2701 N Swan Rd""","""Tucson""","""AZ""","""85712""",32.256407,-110.893058,3.5,20,1,"""{'BusinessAcceptsCreditCards':…","""Preschools, Elementary Schools…","""{'Monday': '6:30-18:0', 'Tuesd…"


In [None]:
# df.sample(fraction=0.0001)

In [None]:
df.sample(n=1, with_replacement=True)

## <strong style="color:#5e17eb">  Copy the data frame</strong>


In [None]:
dff = df.clone()

# <strong style="color:#5e17eb"> II. DataFrame Attributes & Methods  </strong>


| Pandas              | Polars                |
| ------------------- | --------------------- |
| `df.shape`          | `df.shape`            |
| `df.columns`        | `df.columns`          |
| `df.head()`         | `df.head()`           |
| `df.tail()`         | `df.tail()`           |
| `df.info()`         | `df.describe()`       |
| `df.dtypes`         | `df.dtypes`           |
| `df.memory_usage()` | `df.estimated_size()` |


## <strong style="color:#5e17eb"> shape </strong>


In [None]:
df.shape

## <strong style="color:#5e17eb"> columns </strong>


In [1]:
df.columns

NameError: name 'df' is not defined

## <strong style="color:#5e17eb"> dtypes </strong>


In [2]:
df.dtypes

NameError: name 'df' is not defined

- Rather then printing the data types
- schema gives the better idea on column & data type 
- Data type only provides the type not columns 
- How we know the columns 😆


## <strong style="color:#5e17eb"> schema </strong>


<strong style="color:#5e17eb"> Schema </strong>
- Returns a Python dict showing the column names and their data types.
- You are checking the actual structure of your DataFrame at a given moment (especially in eager mode).

In [3]:
df.schema

NameError: name 'df' is not defined

## <strong style="color:#5e17eb"> collect_schema() </strong>


`df.collect_schema()`
- As we discussed early we used collect when we are in lary mood 
- So this functions helps to fetch the schema

In [4]:
df.collect_schema()

NameError: name 'df' is not defined

## <strong style="color:#5e17eb"> match_to_schema </strong>


- In operations like `concat`, `join`, or `merging` multiple files, schemas must match.

> ❗ match_schema() works at the type and column-name level, not just names.

In [None]:
df1 = pl.DataFrame({"a": [1, 2], "b": ["x", "y"]})
df2 = pl.DataFrame({"a": [3, 4], "b": ["z", "w"]})

print(df1.schema == df2.schema ) 
print(df1.schema.items() == df2.schema.items()) 

## <strong style="color:#5e17eb">   head() & tail()</strong>


In [None]:
df.head()

In [None]:
df.tail()

## <strong style="color:#5e17eb"> describe() </strong>


In [None]:
df.describe()

## <strong style="color:#5e17eb">  estimated_size() </strong>


In [None]:
df.estimated_size()

## <strong style="color:#5e17eb"> null_count </strong>


In [None]:
df.null_count()

## <strong style="color:#5e17eb"> drop() & drop_nulls() & drop_nans() </strong>


| Method            | Drops   | Acts on   | In-place? | Notes                        |
| ----------------- | ------- | --------- | --------- | ---------------------------- |
| `drop()`          | Columns | Structure | ❌         | Returns new df               |
| `drop_in_place()` | Columns | Structure | ✅         | Directly modifies df         |
| `drop_nulls()`    | Rows    | Content   | ❌         | Removes rows with any `null` |
| `drop_nans()`     | Rows    | Content   | ❌         | Removes rows with any `NaN`  |


## <strong style="color:#5e17eb">corr()  </strong>


In [None]:
# This take time
# df.corr()

## <strong style="color:#5e17eb">  count() </strong>


In [6]:
df.count()

business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
150346,150346,145219,150346,150346,150273,150346,150346,150346,150346,150346,136602,150243,127123


## <strong style="color:#5e17eb">  select() & select_seq() </strong>


- `.select()` → Pick columns or transformations (like SQL SELECT).

- `.select_seq()` → Sequential version (executes expressions in order of appearance), useful when later columns depend on earlier ones.

In [7]:
df.select('hours')

hours
str
""
"""{'Monday': '0:0-0:0', 'Tuesday…"
"""{'Monday': '8:0-22:0', 'Tuesda…"
"""{'Monday': '7:0-20:0', 'Tuesda…"
"""{'Wednesday': '14:0-22:0', 'Th…"
…
"""{'Monday': '10:0-19:30', 'Tues…"
"""{'Monday': '9:30-17:30', 'Tues…"
""
"""{'Monday': '9:0-20:0', 'Tuesda…"


In [11]:
df.select_seq([
    pl.col("stars").alias('x'),
    (pl.col("stars") + 1).alias("stars_plus_one")
]).select(['x', 'stars_plus_one'])

x,stars_plus_one
f64,f64
5.0,6.0
3.0,4.0
3.5,4.5
4.0,5.0
4.5,5.5
…,…
3.0,4.0
4.0,5.0
3.5,4.5
4.0,5.0


> 🔥 Use `.select_seq()` when transformations are step-wise dependent.

## <strong style="color:#5e17eb">  df.serialize()
 </strong>

- Converts the entire DataFrame into raw bytes (for storage or transmission).
- Used when saving Polars DataFrames in custom binary formats.
- Not for human-readable saving (use .write_parquet() for that).

In [None]:
#df.serialize()

## <strong style="color:#5e17eb">  df.slice(offset, length) </strong>
- offset -> Row index
- length -> how may rows

In [22]:
df.slice(offset=13, length=3)

business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
str,str,str,str,str,str,f64,f64,f64,i64,i64,str,str,str
"""jaxMSoInw8Poo3XeMJt8lQ""","""Adams Dental""","""15 N Missouri Ave""","""Clearwater""","""FL""","""33755""",27.966235,-82.787412,5.0,10,1,"""{'ByAppointmentOnly': 'True'}""","""General Dentistry, Dentists, H…","""{'Monday': '7:30-15:30', 'Tues…"
"""0bPLkL0QhhPO5kt1_EXmNQ""","""Zio's Italian Market""","""2575 E Bay Dr""","""Largo""","""FL""","""33771""",27.916116,-82.760461,4.5,100,0,"""{'OutdoorSeating': 'False', 'R…","""Food, Delis, Italian, Bakeries…","""{'Monday': '10:0-18:0', 'Tuesd…"
"""MUTTqe8uqyMdBl186RmNeA""","""Tuna Bar""","""205 Race St""","""Philadelphia""","""PA""","""19106""",39.953949,-75.143226,4.0,245,1,"""{'RestaurantsReservations': 'T…","""Sushi Bars, Restaurants, Japan…","""{'Tuesday': '13:30-22:0', 'Wed…"


## <strong style="color:#5e17eb"> to_series()  </strong>


In [23]:
#df.to_series()

## <strong style="color:#5e17eb">  sort() & set_sorted() </strong>


In [28]:
df.sort("stars").select(['stars'])

stars
f64
1.0
1.0
1.0
1.0
1.0
…
5.0
5.0
5.0
5.0


In [25]:
df.set_sorted("stars").select(['stars'])

stars
f64
5.0
3.0
3.5
4.0
4.5
…
3.0
4.0
3.5
4.0


| Method                 | Changes Row Order? | Affects Display? | Use Case                  |
| ---------------------- | ------------------ | ---------------- | ------------------------- |
| `df.sort("col")`       | ✅ Yes              | ✅ Yes            | To reorder data           |
| `df.set_sorted("col")` | ❌ No               | ❌ No             | To **optimize** later ops |

--

> set_sorted("col") is a promise, not an action.
> If your data isn’t actually sorted, use .sort("col").

## <strong style="color:#5e17eb"> df.iter_slices(n_rows)  </strong>

- Iterates through the DataFrame in row batches (n_rows at a time).

- Great for low memory environments.

In [32]:
for batch in df.iter_slices(100_000):
    print(batch)

shape: (100_000, 14)
┌────────────┬───────────┬───────────┬───────────┬───┬─────────┬───────────┬───────────┬───────────┐
│ business_i ┆ name      ┆ address   ┆ city      ┆ … ┆ is_open ┆ attribute ┆ categorie ┆ hours     │
│ d          ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---     ┆ s         ┆ s         ┆ ---       │
│ ---        ┆ str       ┆ str       ┆ str       ┆   ┆ i64     ┆ ---       ┆ ---       ┆ str       │
│ str        ┆           ┆           ┆           ┆   ┆         ┆ str       ┆ str       ┆           │
╞════════════╪═══════════╪═══════════╪═══════════╪═══╪═════════╪═══════════╪═══════════╪═══════════╡
│ Pns2l4eNsf ┆ Abby Rapp ┆ 1616      ┆ Santa     ┆ … ┆ 0       ┆ {'ByAppoi ┆ Doctors,  ┆ null      │
│ O8kk83dixA ┆ oport,    ┆ Chapala   ┆ Barbara   ┆   ┆         ┆ ntmentOnl ┆ Tradition ┆           │
│ 6A         ┆ LAC, CMQ  ┆ St, Ste 2 ┆           ┆   ┆         ┆ y':       ┆ al        ┆           │
│            ┆           ┆           ┆           ┆   ┆         ┆ 'True

## <strong style="color:#5e17eb"> df.with_columns() & df.with_columns_seq() </strong>


- with_columns() → Add or modify columns.

- with_columns_seq() → Like .select_seq(), but for step-dependent column transformations.

In [34]:
dff = df.with_columns([
    (pl.col("stars") + 1).alias("stars_plus_1")
])


# III. <strong style="color:#5e17eb">  Summary </strong>


| Function                        | Use Case / Description                         | When & Why to Use                                                        |
| ------------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------ |
| df.select()                     | Select specific column(s)                      | Extract a subset of columns for faster processing                        |
| df.filter()                     | Filter rows based on conditions                | Focus on specific segments of data (e.g. open businesses with ≥ 4 stars) |
| df.with\_columns()              | Add or modify columns                          | Feature engineering, transforming values                                 |
| df.drop()                       | Remove column(s)                               | Reduce memory footprint, remove unwanted data                            |
| df.take(\[i, j])                | Take rows by index positions                   | Quick sampling or slicing                                                |
| df.row(index)                   | Get a single row as a Python tuple             | Look up a specific record                                                |
| df.null\_count()                | Count of nulls in each column                  | Identify missing values                                                  |
| df.describe()                   | Summary stats (mean, std, min, max, etc.)      | EDA: quick overview of numeric columns                                   |
| df.estimated\_size()            | Estimate memory size of the DataFrame          | For RAM usage planning                                                   |
| df.n\_unique()                  | Exact number of unique values per column       | Helpful in categorical columns or key detection                          |
| df.approx\_n\_unique()          | Approximate unique count (fast for large data) | Faster uniqueness estimate on big datasets                               |
| df.bottom\_k("col", k=5)        | Get rows with the lowest k values in column    | Useful for ranking tasks, outlier detection                              |
| df.top\_k("col", k=5)           | Get rows with the highest k values             | See best-performing businesses, top-rated, etc.                          |
| df.schema                       | Returns dict of column names and datatypes     | Validate structure before downstream tasks                               |
| df.columns                      | List of all column names                       | Use in loops or programmatic column selection                            |
| df.dtypes                       | Data types of each column                      | Check for type mismatches                                                |
| df.get\_column("col")           | Return one column as a Series                  | Useful for column-wise computation                                       |
| df.get\_column\_index("col")    | Return the integer index of a named column     | Needed for low-level indexing                                            |
| df.unique()                     | Return only unique rows                        | Use for deduplication                                                    |
| df.corr("a", "b")               | Correlation between two numeric columns        | Stars vs. review\_count, e.g.                                            |
| df.mean(), df.max() etc.        | Aggregate calculations                         | Quick summaries                                                          |
| df.to\_pandas()                 | Convert to pandas DataFrame                    | When needed to use pandas-specific tools                                 |
| df.clone()                      | Deep copy of a DataFrame                       | To safely modify without affecting original                              |
| df.sort("col", descending=True) | Sort by a column                               | Rank data for display                                                    |
| df.melt()                       | Wide to long format                            | Reshape data for plotting                                                |
| df.explode("col")               | Explode list-like column into multiple rows    | Use for category splitting                                               |
| df.rename({"old": "new"})       | Rename columns                                 | Clean up for consistency                                                 |
| df.cast({"col": pl.Int64})      | Change data type                               | Convert float to int, string to date, etc.                               |
| df.is\_empty()                  | Check if DataFrame is empty                    | For guarding downstream logic                                            |
| df.find\_idx\_by\_name("col")   | Alternative to get\_column\_index              | Internal use for indexing                                                |
| df.drop\_nulls()                | Drop any rows with nulls                       | Simple data cleanup                                                      |
| df.hash\_rows()                 | Generate unique hash per row                   | For deduplication or join keys                                           |
| df.sample(n=5)                  | Random sample of n rows                        | EDA, previewing                                                          |
| df.limit(n)                     | Return first n rows                            | Quick preview                                                            |
| df.frame\_equal(df2)            | Test if two DataFrames are equal               | Unit testing or validation                                               |


<div style="text-align: center;">
  <h4 style="
    display: inline-block;
    color: #5e17eb;
    font-family: 'Segoe UI';
    border-left: 5px solid #729be8ff;
    background-color: #F8F9F9;
    padding: 10px 20px;
    border-radius: 5px;
    text-align: left;
  "><b>
    Thank You 💜
    </b>
  </h4>
</div>