In [1]:
# # Data science with python

# Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, 
# and systems to extract knowledge and insights from structured and unstructured data. It encompasses 
# a range of techniques and tools from statistics, computer science, information theory, and domain-specific knowledge. 
# Here’s a more detailed breakdown of the key components and steps involved in data science:

# Key Components of Data Science

# Data Collection and Acquisition:

'''
Gathering data from various sources such as databases, APIs, web scraping, surveys, and more.
Ensuring data quality and relevance to the problem at hand.

'''

# Data Cleaning and Preprocessing:

'''
Handling missing values, removing duplicates, and correcting errors.
Normalizing or scaling data.
Transforming data into a suitable format for analysis.

'''
# Exploratory Data Analysis (EDA):

'''
Summarizing main characteristics of the data using statistical measures.
Visualizing data through plots and charts to identify patterns, trends, and anomalies.

'''
# Feature Engineering:

'''
Creating new features or modifying existing ones to improve model performance.
Selecting the most relevant features for the analysis.

'''
# Data Modeling:

'''
Applying various algorithms to build predictive or descriptive models.
Techniques include regression, classification, clustering, time-series analysis, and more.

'''
# Model Evaluation:

'''
Assessing the model’s performance using metrics such as accuracy, precision, recall, F1-score, ROC-AUC, etc.
Cross-validation and validation on test datasets to ensure model generalizability.

'''
# Model Deployment:

'''
Implementing the model in a production environment where it can make real-time predictions.
Monitoring the model’s performance over time and updating it as needed.

'''
# Communication and Visualization:

'''

Presenting findings and insights through dashboards, reports, and visualizations.
Translating technical results into actionable business insights.
Tools and Technologies in Data Science

'''

# Programming Languages:

'''

Python: Widely used for its simplicity and powerful libraries like NumPy, pandas, scikit-learn, TensorFlow, and PyTorch.
R: Popular for statistical analysis and data visualization.

'''
# Data Manipulation and Analysis:

'''
pandas and NumPy (Python)
dplyr and tidyverse (R)
'''

# Data Visualization:

'''
Matplotlib, Seaborn, and Plotly (Python)
ggplot2 (R)

'''

# Machine Learning and Deep Learning:

'''
scikit-learn, TensorFlow, Keras, PyTorch (Python)
caret and mlr (R)

'''

# Big Data Technologies:

'''
Apache Hadoop, Spark, Kafka
'''

# Database Management:

'''
SQL, NoSQL databases like MongoDB, Cassandra
'''
# Cloud Platforms:

'''
AWS, Google Cloud Platform (GCP), Microsoft Azure
'''

# Collaboration and Version Control:

'''
Git, GitHub, Jupyter Notebooks

'''

# Applications of Data Science

'''

Business Analytics: Optimizing operations, improving customer experience, and driving strategic decisions.

Healthcare: Predicting disease outbreaks, personalized medicine, and improving patient care.

Finance: Fraud detection, risk management, and algorithmic trading.

Marketing: Customer segmentation, sentiment analysis, and targeted advertising.

E-commerce: Recommendation systems, inventory management, and sales forecasting.

Social Media: Analyzing user behavior, content recommendation, and sentiment analysis.

Government and Public Policy: Policy analysis, public health, and crime prediction.


'''
# Conclusion

# Data science is a powerful tool that can be applied across various industries to solve complex problems and 
# make data-driven decisions. Its interdisciplinary nature requires a blend of skills in statistics, programming, 
# and domain expertise, making it a dynamic and evolving field.

'\n\nBusiness Analytics: Optimizing operations, improving customer experience, and driving strategic decisions.\n\nHealthcare: Predicting disease outbreaks, personalized medicine, and improving patient care.\n\nFinance: Fraud detection, risk management, and algorithmic trading.\n\nMarketing: Customer segmentation, sentiment analysis, and targeted advertising.\n\nE-commerce: Recommendation systems, inventory management, and sales forecasting.\n\nSocial Media: Analyzing user behavior, content recommendation, and sentiment analysis.\n\nGovernment and Public Policy: Policy analysis, public health, and crime prediction.\n\n\n'

In [2]:
%pip install seaborn


Note: you may need to restart the kernel to use updated packages.


In [3]:
# Step 1: import neccessary libraries

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns


In [9]:
# step 2: Load dataset

df = pd.read_csv('niceguys.csv')

print(df.to_string())

               name sex  age  living_allowance                   course  number_of_siblings  meals_per_day has_boyfriend_girlfriend grade  attendance_rate part-time_job
0      Alice Fenton   F   20               300         Computer Science                   1              3                      Yes     A               95            No
1         Bob Smith   M   22               250   Mechanical Engineering                   2              2                       No     B               90           Yes
2       Cathy Brown   F   19               400              Mathematics                   0              3                       No     A               98            No
3         David Lee   M   21               350  Business Administration                   3              2                      Yes     B               85           Yes
4        Emma Jones   F   18               200               Psychology                   1              3                       No     A               92 

In [5]:
# Step 3: Data cleaning and preprocessing

# Check for missing values

print(df.isnull().sum())

name                        0
sex                         0
age                         0
living_allowance            0
course                      0
number_of_siblings          0
meals_per_day               0
has_boyfriend_girlfriend    0
grade                       0
attendance_rate             0
part-time_job               0
dtype: int64


In [6]:
# Filling misssing values

df['living_allowance'].fillna(df['living_allowance'].median(), inplace = True)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['living_allowance'].fillna(df['living_allowance'].median(), inplace = True)


In [7]:
# EDA Exploratory data analysis
# Distribution students with age and whether they have girlfriend or boyfriend
# Gender who have boyfriend or girlfriend

# Correlation heatmaps, correlation matrix and plot the heatmaps
# Decision making/ insights of the data, data analysis conclusions based on:
# age and gf or bf
# gender and gf or by
# correlation analysis

In [8]:
import sys

print(sys.version)

3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]
