Skip to content

End-to-end data science reference - importing, wrangling, EDA, modeling, and evaluation with practical code snippets

Notifications You must be signed in to change notification settings

ElsonFilho/Python-Data-Science-Reference-Guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 

Repository files navigation

Python Data Science Reference Guide

A comprehensive reference guide covering the complete data science pipeline - from raw data to production models. It provides examples and best practices for understanding data and utilizing Python libraries to import, explore, analyze, develop models, and evaluate them effectively. Each section below outlines key concepts and links to illustrative Jupyter Notebooks.

1. Importing Data Sets

This section focuses on how to bring data into your Python environment from various sources using popular libraries like pandas.

Learn how to import data from different file formats and sources, handle basic loading parameters, and perform initial inspections of the imported data.

Key Operations and Code Examples:

Package/Method Description Code Example
pd.read_csv() Read a CSV file into a pandas DataFrame. df = pd.read_csv(<CSV_path>, header = None) # load without header
df = pd.read_csv(<CSV_path>, header = 0) # load using first row as header
Note: In JupyterLite, download the file locally and use the local path. In other environments, you can use the URL directly.
df.head(n) Print the first few entries of the DataFrame (default 5). df.head(n) # n = number of entries; default is 5
df.tail(n) Print the last few entries of the DataFrame (default 5). df.tail(n) # n = number of entries; default is 5
df.columns = headers Assign appropriate header names to the DataFrame. df.columns = headers
df.replace("?", np.nan) Replace specific values (e.g., "?") with NumPy's NaN (Not a Number) for handling missing data. df = df.replace("?", np.nan)
df.dtypes Retrieve the data types of each column in the DataFrame. df.dtypes
df.describe() Generate descriptive statistics of the DataFrame. By default, it analyzes numerical columns. Use include="all" to include all data types. df.describe()
df.describe(include="all")
df.info() Provide a concise summary of the DataFrame, including data types, non-null values, and memory usage. df.info()
df.to_csv(<output CSV path>) Save the processed DataFrame to a CSV file at the specified path. df.to_csv(<output CSV path>)

Example Notebook:

2. Data Wrangling

This section covers the essential steps involved in cleaning, transforming, and structuring your data for effective analysis.
Perform some fundamental data wrangling tasks that form the pre-processing phase of data analysis. These tasks include handling missing values in data, formatting data to standardize it and make it consistent, normalizing data, grouping data values into bins, and converting categorical variables into numerical quantitative variables to ensure data quality and consistency.

Package/Method Description Code Example
Replace Missing Data with Frequency Replaces missing values (NaN) in a specified column with the most frequently occurring value (mode). MostFrequentEntry = df['attribute_name'].value_counts().idxmax()
df['attribute_name'].replace(np.nan, MostFrequentEntry, inplace=True)
Replace Missing Data with Mean Replaces missing values (NaN) in a specified column with the average (mean) of all non-missing values. AverageValue = df['attribute_name'].astype(<data_type>).mean(axis=0) df['attribute_name'].replace(np.nan, AverageValue, inplace=True)
Fix Data Types Changes the data type of one or more columns in the DataFrame. df[['attribute1_name', 'attribute2_name', ...]] = df[['attribute1_name', 'attribute2_name', ...]].astype('data_type')
# data_type can be int, float, str, etc.
Data Normalization Scales the values in a specified column to a range between 0 and 1. df['attribute_name'] = df['attribute_name'] / df['attribute_name'].max()
Binning Groups data in a specified column into discrete intervals (bins) for analysis and visualization. bins = np.linspace(min(df['attribute_name']), max(df['attribute_name']), n)
# n is the desired number of bins
GroupNames = ['Group1', 'Group2', 'Group3', ...]
df['binned_attribute_name'] = pd.cut(df['attribute_name'], bins, labels=GroupNames, include_lowest=True)
Change Column Name Renames a specified column in the DataFrame. df.rename(columns={'old_name': 'new_name'}, inplace=True)
Indicator Variables Creates new binary (0 or 1) columns for each unique category in a specified categorical column. dummy_variable = pd.get_dummies(df['attribute_name'])
df = pd.concat([df, dummy_variable], axis=1)

Example Notebook:

3. Exploratory Data Analysis (EDA)

This section focuses on techniques for visually and statistically exploring your data to uncover patterns, relationships, and insights.

Discover how to use various plotting libraries (like Matplotlib and Seaborn) and statistical methods to understand the distribution of variables, identify correlations, and detect potential outliers.

Package/Method Description Code Example
Complete dataframe correlation Correlation matrix created using all the attributes of the dataset. df.corr()
Specific Attribute correlation Correlation matrix created using specific attributes of the dataset. df[['attribute1','attribute2',...]].corr()
Scatter Plot Create a scatter plot using the data points of the dependent variable along the x-axis and the independent variable along the y-axis. from matplotlib import pyplot as plt plt.scatter(df[['attribute_1']], df[['attribute_2']])
Regression Plot Uses the dependent and independent variables in a Pandas data frame to create a scatter plot with a generated linear regression line for the data. sns.regplot(x='attribute_1', y='attribute_2', data=df)
Box plot Create a box-and-whisker plot that uses the pandas dataframe, the dependent, and the independent variables. sns.boxplot(x='attribute_1', y='attribute_2', data=df)
Grouping by attributes Create a group of different attributes of a dataset to create a subset of the data. df_group = df[['attribute_1','attribute_2',...]]
GroupBy statements a. Group the data by different categories of an attribute, displaying the average value of numerical attributes with the same category.
b. Group the data by different categories of multiple attributes, displaying the average value of numerical attributes with the same category.
df_group = df.groupby(['attribute_1'], as_index=False).mean() df_group = df.groupby(['attribute_1','attribute_2'], as_index=False).mean()
Pivot Tables Create Pivot tables for better representation of data based on parameters. grouped_pivot = df_group.pivot(index='attribute_1', columns='attribute_2')
Pseudocolor plot Create a heatmap image using a PseudoColor plot (or pcolor) using the pivot table as data. from matplotlib import pyplot as plt<br>plt.pcolor(grouped_pivot, cmap='RdBu')
Pearson Coefficient and p-value Calculate the Pearson Coefficient and p-value of a pair of attributes. from scipy import stats<br>pearson_coef, p_value = stats.pearsonr(df['attribute_1'], df['attribute_2'])

Example Notebook:

4. Model Development

This section introduces the process of building predictive models using Python's powerful machine learning libraries (like scikit-learn).

Learn the fundamental steps in model development, including selecting appropriate algorithms, training models on your data, and making predictions.

Process Description Code Example
Linear Regression Create a Linear Regression model object. from sklearn.linear_model import LinearRegression lr = LinearRegression()
Train Linear Regression model Train the Linear Regression model on decided data, separating Input and Output attributes. When there is a single attribute in input, it is simple linear regression. When there are multiple attributes, it is multiple linear regression. X = df[['attribute_1', 'attribute_2', ...]] Y = df['target_attribute'] lr.fit(X, Y)
Generate output predictions Predict the output for a set of Input attribute values. Y_hat = lr.predict(X)
Identify the coefficient and intercept Identify the slope coefficient and intercept values of the linear regression model. coeff = lr.coef_ intercept = lr.intercept_
Residual Plot This function will regress y on x (possibly as a robust or polynomial regression) and then draw a scatterplot of the residuals. import seaborn as sns sns.residplot(x=df[['attribute_1']], y=df[['attribute_2']])
Distribution Plot This function can be used to plot the distribution of data with respect to a given attribute. import seaborn as sns sns.distplot(df['attribute_name'], hist=False)
Polynomial Regression Available under the numpy package, for single-variable feature creation and model fitting. f = np.polyfit(x, y, n) p = np.poly1d(f) Y_hat = p(x)
Multi-variate Polynomial Regression Generate a new feature matrix consisting of all polynomial combinations of the features with degree ≤ specified degree. from sklearn.preprocessing import PolynomialFeatures Z = df[['attribute_1', 'attribute_2', ...]] pr = PolynomialFeatures(degree=n) Z_pr = pr.fit_transform(Z)
Pipeline Data Pipelines simplify the steps of processing the data by chaining transformations and models. from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())] pipe = Pipeline(Input) Z = Z.astype(float) pipe.fit(Z, y) ypipe = pipe.predict(Z)
R² value R² (coefficient of determination) measures how close the data is to the fitted regression line. a. For Linear Regression b. For Polynomial Regression a) X = df[['attribute_1', 'attribute_2', ...]] Y = df['target_attribute'] lr.fit(X, Y) R2_score = lr.score(X, Y) b) from sklearn.metrics import r2_score f = np.polyfit(x, y, n) p = np.poly1d(f) R2_score = r2_score(y, p(x))
MSE value The Mean Squared Error (MSE) measures the average of squared errors (difference between actual and predicted values). from sklearn.metrics import mean_squared_error mse = mean_squared_error(Y, Y_hat)

Example Notebook:

5. Model Refinement and Evaluation

This section focuses on assessing the performance of your developed models and techniques for improving their accuracy and generalization.

Understand various evaluation metrics for different types of models and explore methods for hyperparameter tuning and model selection.

Process Description Code Example
Splitting data for training and testing Separate the target attribute from the rest of the data. Then split the input and output datasets into training and testing subsets. from sklearn.model_selection import train_test_split
y_data = df['target_attribute']
x_data = df.drop('target_attribute', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)
Cross validation score When data is limited, use cross validation to evaluate model performance across multiple folds using R² scores. from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
lre = LinearRegression()
Rcross = cross_val_score(lre, x_data[['attribute_1']], y_data, cv=n)
Mean = Rcross.mean()
Std_dev = Rcross.std()
Cross validation prediction Use a cross-validated model to predict output values. from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
lre = LinearRegression()
yhat = cross_val_predict(lre, x_data[['attribute_1']], y_data, cv=4)
Ridge Regression and Prediction Use Ridge regression with alpha to avoid overfitting in polynomial models. from sklearn.linear_model import Ridge
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train[['attribute_1', 'attribute_2', ...]])
x_test_pr = pr.fit_transform(x_test[['attribute_1', 'attribute_2', ...]])
RidgeModel = Ridge(alpha=1)
RidgeModel.fit(x_train_pr, y_train)
yhat = RidgeModel.predict(x_test_pr)
Grid Search Use Grid Search with cross-validation to find the best alpha value for Ridge regression. from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
parameters = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, ...]}]
RR = Ridge()
Grid1 = GridSearchCV(RR, parameters, cv=4)
Grid1.fit(x_data[['attribute_1', 'attribute_2', ...]], y_data)
BestRR = Grid1.best_estimator_
BestRR.score(x_test[['attribute_1', 'attribute_2', ...]], y_test)

Example Notebook:

6. One more complete example

Example Notebook:

About

End-to-end data science reference - importing, wrangling, EDA, modeling, and evaluation with practical code snippets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published