Skip to content

Frizzles7/Dataquest

Repository files navigation

Projects from Dataquest

This repository is for sharing some of my projects from Dataquest. Each of the projects I completed is listed below along with some notes on the concepts and methods used.

Project_01: Numbers of Births

This introductory project practices reading in a file, splitting data, creating a dictionary, and summarizing counts.

Project_02: Births

This project continues with the births dataset and uses dictionaries, loops, creating functions, and calculating some summary statistics.

Project_03: Gun Deaths

This project explores data on gun deaths and uses datetime, dictionaries, and loops to draw conclusions.

Project_04: eBay Car Sales

This project investigates car sales on Germany's eBay website and uses NumPy and Pandas to summarize and clean the data. Examples of methods used include info, head, describe, value_counts, and unique.

This project looks at earnings by college major and explores different visualizations using Pandas and Matplotlib. Examples of visualizations used include scatter plots, histograms, scatter matrix plots, bar graphs, grouped histograms, box plots, and hexagonal bin plots.

This project looks at the degrees awarded by gender and uses different visualizations with more customizations. Examples include using options to change the size, layout, labels, tick marks, colors, and other display parameters in the visualizations.

Project_07: NYC High Schools

This project incorporates data on New York City high schools. Data is initially provided in multiple files, which I combine and convert data to numeric data, using regex as needed. Data visualizations used include scatter plots, bar graphs, and a map using basemap.

Project_08: Star Wars Survey

This project investigates survey data on Star Wars movies. I cleaned and transformed the provided data into a usable format with usable columns. I then compiled various statistics using NumPy and Pandas and created visualizations using bar graphs from Matplotlib and Seaborn.

This project involved creating and manipulating files from the command line, rather than in Jupyter. In this project, I created several scripts to read in the data, perform counts of some variables, create pivot tables, and create a dictionary of counts. The results of the analysis performed were printed to the findings.txt file.

This project continued my work from the command line, writing scripts to answer some questions about the data provided on submissions to Hacker News. I created scripts to read in the data, determine the most common words in the headlines, determine which domains were submitted most frequently to Hacker News, and determine at what time of day submissions were most commonly made.

This project looks at the CIA World Factbook database using SQLite. I created SELECT queries to review the data, look for outliers, and calculate several different statistics. I also created visualizations using Matplotlib and Seaborn.

This project uses the chinook database of sales for a digital media store. I used SQLite to run queries on the database and used Matplotlib to create visualizations of the results. I use the results of the queries and visualizations to make business recommendations regarding which types of albums to add to the store, the performance of the different sales agents, which countries generate sales, and whether sales should be of individuals songs or full albums.

This project uses SQLite to design and create a database on Major League Baseball statistics initially provided in several different files. I review the data provided and create tables. I also design the schema for the database. I then build the database as outlined in my schema.

This project looks at whether Fandango movie ratings have changed since it was discovered that they were rounding up ratings. Using Pandas, NumPy, and Matplotlib, I investigate the differences in ratings from before this discovery and after using calculated statistics and visualizations, including kernel density plots and grouped bar plots.

This project looks at an e-learning company to determine the best markets in which to advertise. I use NumPy, Pandas, and Matplotlib to compute different metrics, normalize the values, and create visualizations to make business recommendations. I also handle missing values and outliers in the data.

Project_16: Jeopardy!

This project uses a portion of the Jeopardy! dataset to look for patterns to help win the game. I use regular expressions to clean text strings, write functions to determine whether the answer can be deduced from the question, and determine whether new questions are recycled old questions. Finally, I used chi-squared tests to determine if any of the terms had a statistically significant difference between high and low value questions.

This project uses a data set on cars where I investigate using k-nearest neighbors to predict the market price of a car. I use KNeighborsRegressor, mean_squared_error, cross_val_score, and KFold in the analysis. I also create visualizations of the RMSEs calculated, determine the best features to use in the model, and tune the value of k in k-nearest neighbors.

This project looks at housing data for Ames, Iowa to perform several data preparation tasks and different linear regression techniques. I create a pipeline of functions to transform the features, select specific features for the model, and train and test the model. During the creation of these functions, I investigate the data more closely, deal with missing values, look at correlations, and create a heat map. I also use holdout validation, simple cross validation, and k-fold cross validation.

This project evaluates historical S&P 500 stock price data. I create a script to predict closing prices, in which I create several features for the models to use, train a linear regression model, and evaluate the model based on root mean squared error.

This project uses data on bike rentals in Washington, DC. I use several different models to predict the count of bike rentals in a given hour. I review the data and create visualizations, including a histogram of the count of bike rentals and heatmap of correlations. I begin with linear regression. Next, I use decision trees to improve signficantly over linear regression, and I tweak the model to make it perform even better. Then, I use a random forest to improve even more over the best decision tree model, and tune the parameters of that model to make it perform even better.

This project uses the sklearn load_digits dataset to build models to classify handwritten digits. I review the data provided, create visualizations of the images, and train a k-nearest neighbors model to start. I then build several variations of neural network models to evaluate how the models perform as the number of layers increase and the number of neurons per layer increase.

This project uses data on airplane accidents. I write a script to use this data to work with different data structures - lists and dictionaries - and to evaluate the time complexity of the code. The script also uses try/except to handle missing or invalid values.

This project uses the Titanic dataset from Kaggle to practice creating a workflow. I build a pipeline of functions to perform data exploration, feature selection, model selection and tuning, and finally submit a predictions file to Kaggle. My score for this submission on Kaggle was 0.77990.