<a href="https://colab.research.google.com/github/Jesmeeksingh/Machine_Learning_projects/blob/main/Pandas_noob_to_pro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1 Writing to a CSV with an unnecessary index:
Explanation: When saving a DataFrame to a CSV file, the default behavior includes the index as a separate column without a header. This can lead to confusion, especially when reading the CSV back into pandas.

Solution: Set index=False when saving to CSV to avoid including the index, or specify an index column when reading the CSV.

In [1]:
#importing necessary libraries
import pandas as pd

In [17]:
#creating a sample dataframe
df = pd.DataFrame({'x':[1,2,3],'y':[5,6,7]})
df.index = ['a','b','c']

In [18]:
df

Unnamed: 0,x,y
a,1,5
b,2,6
c,3,7


In [19]:
df.to_csv('sample.csv')

as we can see we have an extra unnamed field here.this field is the index of the data

In [20]:
sample = pd.read_csv('sample.csv')
sample

Unnamed: 0.1,Unnamed: 0,x,y
0,a,1,5
1,b,2,6
2,c,3,7


To overcome this we will write index=false such that we will not get these unnamed fields

In [21]:
df.to_csv('sample.csv',index=False)

Now we will not have any unnamed field

In [22]:
sample = pd.read_csv('sample.csv')
sample

Unnamed: 0,x,y
0,1,5
1,2,6
2,3,7


#2 Using column names that include spaces:
Explanation: Column names with spaces can cause issues, such as losing the ability to access columns using dot syntax.

Solution: Use underscores instead of spaces in column names to improve readability and maintainability.

In [37]:
#creating a sample dataframe
df = pd.DataFrame({'x':[1,2,3],'y':[5,6,7],'a key':[8,9,10]})
df.index = ['a','b','c']

Here we can easily access the x column of the dataframe

In [38]:
df.x

a    1
b    2
c    3
Name: x, dtype: int64

Now lets try for a key column

In [26]:
df.a key

SyntaxError: invalid syntax (<ipython-input-26-6e406f468c3d>, line 1)

As we can see we got error

Now lets resolve this error by replacing ' ' with '_'


*random general info* : df.columns gives us the list of column names in the dataframe

In [39]:
print(df)

   x  y  a key
a  1  5      8
b  2  6      9
c  3  7     10


In [40]:
#creating a sample dataframe with underscores
df = pd.DataFrame({'x':[1,2,3],'y':[5,6,7],'a_key':[8,9,10]})
df.index = ['a','b','c']

Here we can easily access the colum using dot

In [41]:
df.a_key

a     8
b     9
c    10
Name: a_key, dtype: int64

#3 Not leveraging the query method:
Explanation: The query() method allows you to filter a DataFrame using a more expressive and readable syntax compared to traditional boolean indexing.

Solution: Use the query() method for filtering when the query criteria become complex.


In [42]:
#creating a sample dataframe with underscores
df = pd.DataFrame({'First_name':['rahul','reena','riyaz'],'age':[50,48,79],'Gender':['male','female','male']})

Querry method is very powerfull method to use

In [44]:
df.query('age>40 and Gender=="male"')

Unnamed: 0,First_name,age,Gender
0,rahul,50,male
2,riyaz,79,male


#4 Using string methods to formulate query strings:
Explanation: Instead of manually creating query strings using string concatenation, you can directly reference variables within the query string.

Solution: Use the @ symbol to reference variables within query strings.


In [46]:
#lets take a variable and use that variable in the querry using @ symbol
age_threshold = 50
df.query('age>=@age_threshold')

Unnamed: 0,First_name,age,Gender
0,rahul,50,male
2,riyaz,79,male


#5 Using inplace=True:
Explanation: Using inplace=True can modify the original DataFrame in place, which can lead to unexpected behavior and is generally discouraged.

Solution: Instead of using inplace=True, assign the result to a new variable or explicitly overwrite the original DataFrame.

By avoiding inplace=True and instead assigning the result to a new variable or explicitly overwriting the original DataFrame, you can ensure clearer, more maintainable, and less error-prone code.

And also when we store it in seperate variable we preseve the old dataframe such that in future if we have to check the data preprocessing step or we need the old not modified dataframe we can get it

In [50]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan],
        'Age': [25, np.nan, 35, 30],
        'Gender': ['Female', 'Male', np.nan, 'Male']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows with missing values in Age column
df_cleaned = df.dropna(subset = ['Age'])

print("\nDataFrame after dropping rows with missing values:")
print(df_cleaned)

Original DataFrame:
      Name   Age  Gender
0    Alice  25.0  Female
1      Bob   NaN    Male
2  Charlie  35.0     NaN
3      NaN  30.0    Male

DataFrame after dropping rows with missing values:
      Name   Age  Gender
0    Alice  25.0  Female
2  Charlie  35.0     NaN
3      NaN  30.0    Male


*Pandas are built on the top of numpy. And it can utilize numpy functionalities to efficiently perform actions on the pandas saving a lot of computational time*

In [1]:
import pandas as pd
import numpy as np

In [9]:
df = pd.DataFrame({'A':[1,2,3,4]})

In [10]:
nump = df.to_numpy()

In [11]:
nump

array([[1],
       [2],
       [3],
       [4]])

In [12]:
nump = nump+1

In [13]:
nump

array([[2],
       [3],
       [4],
       [5]])