Skip to content

BabesGotByte/DATA-MINING-Algorithms

Repository files navigation

DATA-MINING-Algorithms

Generic badge Generic badge Generic badge

Algorithms Discussed:

We have discussed the following algorithms:

  • Apriori algorithm
  • Decision tree ID3 algorithm
  • FP Growth Algorithm
  • Bayesian Classification Algorithm
  • Web Crawling Problem
  • KNN Algorithm
  • Linear Regression with One variable
  • Linear Regression with Multiple Variables
  • Support Vector Machine Model
  • BIRCH Algorithm
  • DBSCAN Algorithm
  • K-Mean Algorithm
  • PAM Algorithm
  • Decision tree using C4.5 and CART algorithm
  • Hierarchical Clustering Algorithm
  • OPTICS Algorithm
  • Face Detection Algorithm
  • Perceptron Algorithm

Problem Statements

Algorithm-1:

Dataset used: weather.csv

Perform the following operations on the weather dataset using Pandas.

  • Reading a dataset into a dataframe.
  • Dropping rows with missing(”NaN”) values.
  • Dropping columns with missing(”NaN”) values.
  • Filling the ”Nan” values with mean, median.
  • Split data set by row and column wise.

Algorithm-2:

Dataset used: data folder(chess.dat, mushroom.txt,retail.dat, FILE1.txt, FILE2.txt)

Implement Apriori algorithm for association rules. Run the algorithm with two different support and confidence level defined by you. (Chees, Mushroom, Retail dataset can be used.)

  • Print closed itemset.
  • Print closed frequent itemset.

Note: Let Y ⊆ I and X ⊆ Y
If the X is an infrequent itemset, then Y is also an infrequent itemset. On that basis apply the Apriori algorithm.

Algorithm-3:

Dataset used: car.data.txt

Implement decision tree ID3 algorithm for the given dataset for Car Evaluation Database.

  • Attribute Information: Six input attributes: buying, maint, doors, persons, lugboot, safety
  • Class Values: unacc, acc, good, vgood
  • Attributes:
    ∗ buying: vhigh, high, med, low.
    ∗ maint: vhigh, high, med, low.
    ∗ doors: 2, 3, 4, 5,more.
    ∗ persons: 2, 4, more.
    ∗ lug-boot: small, med, big.
    ∗ safety: low, med, high.

Algorithm-4:

Dataset used: Online Retail.xlxs

Implement FP Growth algorithm on the given dataset.

Algorithm-5(i):

Dataset used: DATASET.xlsx

Using Baysian classfication, predict the class (Target wait) for the following sample.
X=(alt=T, Bat=T, Fri=F, Hun=T, Pat=Some, Price=$$$, Rain=T,Res=T, Type=Italian, Est>60).

Algorithm-5(ii):

The task is a web crawling problem.

  • Write a program to stream web page, http://en.wikipedia.org/wiki/India.
  • Count the number of hyperlinks in this page.
  • Provide a unique number to each link.
  • Select a link from the found links and repeat the steps from 1 to 3.
  • Repeat above steps at least two times and generate an adjacency matrix.

Algorithm-6(i):

Dataset used: DATASET.xlsx

Using Baysian classfication, predict the class (Target wait) for the following sample.
X= (alt = T, Bat = T, Fri = F, Hun = T, Pat = Some, Price = $$$, Rain = T, Res = T, Type = Italian, Est > 60).

Algorithm-6(ii):

Dataset used: data_sheet.xlsx

Predict a class label using naïve Bayesian classification for the tuple:
X = {age = “<= 30”, income = “medium”, student = “yes”, credit rating = “fair”}

Algorithm-7:

Dataset used: iris-dataset.csv , iris-test.csv

Implementation the KNN algorithm for classification purpose in Python using the following instructions:

  • The Iris data set is bundled for test, however you are free to use any data set of your choice provided that it follows the specified format.
    Data set format:
  • Attributes can be integer or real values.
  • List attributes first, and add response as the last parameter in each row.
        * E.g. [4.5, 7, 2.6, "Orange"], where the first 3 numbers are values of attributes and "Orange" is one of the response classes.
        * Another example can be [1.2, 4.3, 3], in this case, there are 2 attributes while the response class is the integer 3.
  • Responses can be integer, real or categorical.

Algorithm-8(i):

Dataset used: ex1data1.txt

Implement the linear regression with one variable to predict profits for a food truck.
Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. You would like to use this data to help you select which city to expand to next. The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss.

Algorithm-8(i):

Dataset used: ex1data2.txt

Implement the linear regression with multiple variables to predict the prices of houses.
Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

Algorithm-9(i):

Dataset used: data.xlsx

Write a program to train a linear SVM using the dataset given in file data.xlsx and test it using some unseen data. (Don’t use library function of SVM)

Algorithm-9(ii):

Dataset used: ionosphere.data

Train an SVM for ionosphere dataset. Divide the dataset into training and testing sets and find accuracy of SVM.

Algorithm-10:

Dataset used: Dataset.txt

Perform the BIRCH algorithm for the dataset.

Algorithm-11:

Dataset used: Dataset.txt

Perform the DBSCAN algorithm for the dataset.

Algorithm-12(i):

Dataset used: Absenteeism_at_work.xls

Perform the K-Mean algorithm for the dataset.

Algorithm-12(ii):

Dataset used: Absenteeism_at_work.xls

Perform the PAM algorithm for the dataset.

Algorithm-13:

Dataset used: car.data.txt

Implement decision tree using C4.5 and CART algorithm for the for Car Evaluation Dataset.

  • Attribute Information: Six input attributes: buying, maint, doors, persons, lug boot, safety
  • Class Values: unacc, acc, good, vgood
  • Attributes:
    Buying: vhigh, high, med, low.
    Maint: vhigh, high, med, low.
    Doors: 2, 3, 4, 5more.
    Persons: 2, 4, more.
    Lug boot: small, med, big.
    Safety: low, med, high.

Algorithm-14:

Dataset used: qla.csv , matrix.xlsx

Implement Hierarchical clustering algorithm and apply it on the qla.xlxs dataset. Also, show the resulting dendograms after applying average linkage approach.

Algorithm-15:

Dataset used: data1.xlsx, data2.xlsx

Implement OPTICS algorithm and apply it on datasets (for this epsilon = 0.02, minPts = 500) and output each point's reachability distance, core distance and order of points in the reachability graph.

Algorithm-16:

Dataset used: Dataset Manual.txt
Code: PCA Folder

Implement Face detection algorithm using Principle Component Analysis(PCA).

Algorithm-17:

Dataset used: dataset

Implement Face detection algorithm using Linear Discriminant Analysis(LDA).

Algorithm-18:

Dataset used: Input based algorithm.

Implement perceptron algorithm.

ForTheBadge built-with-love